Efficient Storage Strategy for Intermediate Text Data in a Data Processing Pipeline
I am developing a RAG (Retriever-Augmented Generation) application to scrape approximately 10,000 online articles for a chatbot. The application workflow involves scraping data, adding metadata, segmenting and embedding the data, storing it in a vector database, and running queries. I need advice on the best intermediate storage solution for the articles between the scraping and metadata annotation stages.
Best way to store lots of text data
I’m building a RAG application that will scrape ~10k articles from the internet and use those for a chatbot.