I have a pdf which is of mixed type, some pages of this pdf are scanned while some are digital now I want to classify this pdf as “Mixed” so that I can note the scanned page numbers in order to pass those pages to Azure Intelligence(Form Recogniser).
Also there is one peculiar case I came across in another one of the pdfs where the pdf appears to be a scanned one but when I read it using fitz(pymupdf) there exists a text layer so its able to extract text which is not exactly garbage but also not entirely sensible.
Is there any method to solve the above described problems? I cannot exactly share the documents since they are confidential. But am ready to explain it further if required. My code is purely in python.
2472 Anurag Siddhanti is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.