I have a script in which I go through and parse a large collection of PDFs. I noticed that when I tried to parse a particular PDF, the script just stalls forever. But it doesn’t throw up an error and as far as I can tell, the PDF is not corrupted. I can’t tell what the issue is, but I can see that it happens on page 4. Is there a way to find out what is causing this issue, or to just skip the PDF if it is taking longer than one minute to parse?
For reference, here is the PDF: https://go.boarddocs.com/fl/palmbeach/Board.nsf/files/CTWGW9459021/$file/22C-001R_2ND%20RENEWAL%20CONTRACT_TERRACON.pdf
from PyPDF2 import PdfReader
doc = "somefile.pdf"
doc_text = ""
try:
print(doc)
reader = PdfReader(doc)
for i in range(len(reader.pages)):
print(i)
page = reader.pages[i]
text = page.extract_text()
doc_text += text
except Exception as e:
print(f"The file failed due to error {e}:")
doc_text = ""