If I am wrong in my understanding, please feel free to correct me
In my project, we have a timeseries database. It is setup as 3-node (One leader, 2 read-replicas) patroni cluster. Each node is an AWS EC2 instance where time-series data stored in hypertables supported by TimescaleDB extension on a Postgres database. We are using opensource timescaledb here.
As the data is growing each day, the EBS data volume on a node (EC2 instance) is expected to hit its size limit in future. Hence there need of sharding.
As the potential solution, we looked at distributed hypertables in timescaledb. But it seems to be a dead end as they have deprecated multi-node support (on top of which distributed hypertables are provided) in opensource timescaledb.
There is another option i.e. to use Citus (Postgres extension to implement sharding). But, Citus doesn’t support TimescaleDB extension in Postgres. So, as a high level solution in this case, we have to convert timescaleDB hypertables to regular Postgres tables first for us to be able to use Citus. So far Citus seems to be most suitable (relatively) choice to implement sharding.
Could someone please suggest a better way?
Edit Note: Data archival or purge is not an options for us. All of the data is needed. Compression has been applied already as much it is possible. This has bought us some additional time before the storage limit for an EBS data volume is reached, but eventually sharding will be required.
3
I’d like to share some guidance on using Citus for timeseries datasets. If you are ok to convert your hypertables to vanilla tables, here are some of your options:
- Create partitions and shards
- Scale out ( i.e. distribute shards across worker nodes)
- Automate partition creation with
pg_cron
- Archive frozen shards/partitions using columnar storage
More details in the documentation at
https://docs.citusdata.com/en/stable/use_cases/timeseries.html