I’m trying to extract specific data from multiple PDFs. I begin by isolating the example image (Picture 1) using horizontal and vertical lines to create cells. After creating the cells, I crop them before starting pytesseract-OCR to extract the text from each cell, as shown in Picture 2.
Everything works fine until the text extraction step. In some cells, the extraction works perfectly, but in others, it fails. For example, in Picture 2, I want to extract the text “PROJEKTNAMN” and “TRANSPORTGARAGET,” but only the latter is successfully extracted. I believe this issue might be due to different font sizes. I’ve tried adjusting the parameters like oem and psm, but without any improvement.
Does anyone have any suggestions or solutions to help resolve this issue?
Picture 1
Picture 2
Thanks in advance!
The things I have tried is:
- Changing the zoom of the cropped cells.
- Changing the oem and psm and language to swedish with no better results.
David in sweden is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.