Currently we have an inhouse mongo-cdc process which is generating JSON files and writes them to s3. We have an initial snapshot-parquet-dump of mongodb on s3. We have EOD spark job that merges the cdc-json files with the snapshot-parquet files, generating new set of parquet files onto s3.
Is it a good idea to read the CDC files from s3 and keep dumping them into ROCK db? with periodic merges/ compactions? Can spark read dataframe from RockDB –so that the downstream business usecases will continue to run.
We have 10 node m5.2x node cluster running 24X7, and I see 50% of the cluster time is gone into the snapshot merge job (and hence the actual business usecases get very little time). We tried different SQL queries, different ways of partitioning data to reduce the time taken by the snapshot-merge job but it didn’t help.
I am thinking of getting the merge out of spark cluster (and hence also reduce cost on aws)