I am currently developing a software that retrieves important data from a business pitch deck (tl:dr version). Part of this involves getting text out of a PDF. That part is simple enough but keeping the data LLM readable is not.
A little about the PDFs:
They are a PDF version of a slideshow (slideshow unobtainable), 5-35 pages in length, text formatting is all over the place, and important data often is within tables. Font sizes vary greatly within the PDF. Images are also all over the PDF with possible text in the images (text in the image is irrelevant but text on the image could be important)
I am open to trying a different language if it has a library that could be useful.
APIs with associated costs are fine.
Time to do a PDF should be under 10 minutes ideally under 2 minutes.
Thank you for any help you may have, even a small breadcrumb might help.
Here is a list of a few things I have tried
PyMuPdf (fitz) – text did not work with tables in a way that kept it LLM readable
one other pymupdf alternative – text did not work with tables in a way that kept it LLM readable
Breaking the PDF down into smaller images 1-4 pages/image and sending it to gpt4o and sonnet 3.5 one image at a time – Text on some PDFs was too small for it to be accurate. (for anyone wondering, gpt4o was closer)
Pytesseract – super inaccurate