I’m making an RAG project. One of the steps is to extract the text from pdf files. I found that it worked well if the input pdf quality is good, but when the input pdf quality is bad, my RAG fails to answer some of the questions. How can I improve my RAG if I have to deal with low-quality pdf files?
I used PyMuPDF to extract text directly from PDFs.
I used Chroma as my vector db.
I used BAAI/bge-m3 as my embedding model.
user26523689 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.