Relative Content

Tag Archive for pythonopencvpdfimage-processingpdf2image

Exctracting Labelled Diagrams From a Scanned PDF like question Paper

Sample Image Of A page of the PDF
I have to extract the images or diagrams from a scanned PDF via Python where there are no clear boundaries between images and text. For text, I can do OCR, but for the diagrams, libraries like PyMuPDF and PDFminer are not working. I had some success with pdf2image and OpenCV, and with Google Cloud Vision, but it still is not completely accurate. Sometimes it extracts the same image multiple times in different parts or sometimes doesn’t extract the diagram at all. All online tools extract the entire page as the image because the PDF is originally a scanned copy of a printed question paper. I would appreciate any help.