I have a spark job that is reading from an oracle data source and writing to a iceberg table.
There are multiple queries executing in multi-threading, each query hits just one partition (in iceberg).
Code too insert looks like:
df.writeTo(“catalog.db.iceberg_table”).OverwritePartitions().
In the query plan this triggers a shuffle of data, that always results in 27 empty partitions and 1 with full data(28 of spark parallelism)
Is there a away of avoiding this shuffling?
I am not seeing any solution as this seems to be triggered by the dataframewrite v2 api.
I am using spark 3.3
LuisR is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.