I’m wondering what is the most efficient way to store older data that I don’t need to access outside of the database.
Context: I’m managing billions of rows of data in a single relational database with Amazon RDS PostgreSQL. 95+% of the data is timestamp and value data points. The plan is to have rolling data storage, meaning we will keep, for example, 2 years of data easily accessible from the database, but whatever is older does not need to be accessible. Storage in the database is quite expensive and would make the database just grow so big (and expensive) over the years for nothing.
My big picture idea of the solution is that every month, some routine would back up data points older than 2 years and transfer it somewhere cheaper, BUT WHERE, that is the question.
I’ve seen and read about:
- Data Lake
- Data Warehouse
- Data Mart
Thing is I honestly don’t know what fits my needs the best. Whatever I read focuses on things like having access for third parties to easily monitor and perform analytics in the data for ML or AI, which is not useful for me (at least not for now). Some diagrams show an architecture with like 4+ different services architecture to do XYZ. I don’t want to go overkill and keep as simple as possible. I just need to store old data somewhere super cheap, and for now don’t really not to access it after it’s backed up, in order to keep my database no bigger than it needs to. Maybe in the future there will be a need to fetch it in bulk for ML or AI training, but no plan so far in the near future.
Any ideas of what would be optimal?