I have data that i heavily partition. when running historical loads for several years, it can go up to 100.000+ partitions.
the problem being i load them in S3 in dynamic overwrite mode.
The Glue EMFRS Optimized write is enabled. emrfs s3-optimized enabled by default for glue 4.0 which i use.
but it takes forever to load the data into S3.
The spark job does the transformations fairly quickly.
in itself some joins, aggregates,simple transforms.
The only pain point in the treatment itself being a repartition(part_cols) before writing to have only 1 file per partition.
but really if all i mentionned above takes 20-30 minutes in a 10-15workers cluster.
The writing to S3 takes forever.(1hour to 4 hours if really several years of historic). **and meanwhile the workers are useless with CPU usage approx 0% **
do you have any hint on how to save DPU costs?
i know in HDFS such writes would be O(1) complexity but in S3:
- it has to List all object and keys in bucket
- ensure the file doesn’t exit,
- delete it otherwise,
- put the file
and as i said, workers being idle meanwhile incuring costs.
is there as of today a solution of such a problem ?
PS: I also tried some custom committers Magic and Partition:optimized
but perf is similar to EMFRS S3-optimized; which is really slow for that much partitions
Thanks for your help