I’m trying to find a general solution for RAG to solve problems involving both text, images, chart, tables,.., they are in many different formats such as .docx, .xlsx, .pdf.
The requirement for the answer:
- Some answers are just images
- Some answers only contain text and need to be absolutely accurate because it relates to a process,…
- On the other hand, the answers may not need to be absolutely accurate but should still ensure logical consistency; this is something I am already working on
The features of the documents:
- Some documents in DOCX and Excel formats contain only text; this is the simplest form. My task is to determine the embedding model and LLM, in addition to selecting hyperparameters such as chunk size, chunk overlap, etc., and experimenting to find the appropriate values
- If the documents have more complex content, such as DOCX files containing text and images, or PDF files containing text, images, charts, tables, etc., I haven’t found a general solution to handle them yet.
Below are some documents I have read but feel I don’t fully understand, I’m not sure how it can help me.
- https://medium.com/kx-systems/guide-to-multimodal-rag-for-images-and-text-10dab36e3117
- https://blog.langchain.dev/semi-structured-multi-modal-rag/
I want to be able to outline a pipeline to answer questions according to the requirements of my system. Any help would be greatly appreciated!
System:
- LLM was run locally (Llama 3.1 13N Instruct, Qwen2-7B-Instruct,…)