I am working on a project for a publishing house that involves implementing semantic search across an archive of approximately 50,000 articles, each averaging 15 pages in length. My understanding is that I need to use a Retrieval Augmented Generation (RAG) approach to achieve this.
Here are my specific requirements:
Indexing Documents: Convert documents into vector embeddings and store them in a way that allows for efficient retrieval.
Semantic Search: Perform searches based on user queries by converting the queries into embeddings and finding the most relevant documents.
Document Summarization: Summarize the content of the retrieved documents to present concise information to the user.
I would like to know:
What are the recommended tools or frameworks to implement RAG for this use case?
How can I store metadata (like document IDs, titles, URLs) alongside vector embeddings to identify the source documents when presenting search results?
What is the best approach to generate embeddings for both documents and user queries?
I appreciate any advice or suggestions on how to implement this effectively. Thank you!
I have researched various tools and frameworks that might help, including vector databases and embedding models. I understand that storing metadata alongside the embeddings is crucial for identifying source documents. However, I am unsure which specific tools or frameworks would best suit my needs and how to effectively implement and scale this solution.
Simon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.