I’m building a RAG application that will scrape ~10k articles from the internet and use those for a chatbot.
I’m wondering where to store the data in the interim.
The pipeline is essentially:
-
Scrape the data
-
Add metadata
-
Chunk/Embed
-
Add to vector database
-
Run queries
My question is where do I store the articles between steps 1 and 2/3.
These are some ideas I considered:
-
Throw it all into postgres, articles and everything
-
Throw it all into mongodb and add metadata as fields
-
Store the articles in s3 (each source would have a single file with all the articles) and then store the metadata in postgres along with an id for each article (probably the url?)
The other thing worth noting is that I plan to run queries over the article contents to generate the metadata. For example, I want to pass the article contents to an LLM and have the LLM return the topic of the article