I have a python code in which I am trying to read the contents of various pdf files-scanned and text based both using pdfminer , the code is like this:
``with open(os.path.join(pdf_directory, file_name), 'rb') as file:
output_string = StringIO()
parser = PDFParser(file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
text=output_string.getvalue()
if len(text)==0:
#if a mixture of pages are there then we do need to modify this approach
extract_ocr(pdf_directory,file_name,text_directory)``
Where extract_ocr is a different function where ocr-processing for scanned pdf is done, the code is failing as the text extracted from it is indeed empty but the len(text)==0 condition is not satisfied as len(text)!=0 , this condition was working earlier when i was using pypdf2 instead of pdfminer, any suggestions how to tackle this condition?