I have a Databricks workflow that creates several entities (each represented as one task in the workflow). Some of them due to optimization issues, I optimized by hand by changing the shuffle size. When such a task finishes, I the shuffle.partitions
is reverted to the default value (200).
Maybe someone can help me to understand what generally happens if one of the tasks processed in parallel with other tasks receives change of the spark configuration for shuffle partitions?
-
To be precise what happens with other entities (having a default value of shuffle partitions – 200) that let’s say already started processing a few seconds/minutes ago?
-
What would be the best approach in such a situation? Should such task/-s with different shuffle partition values be isolated in the workflow?
Additional description:
- Each of Databricks Tasks follows the same logic:
Read data from X source.
Transform
Write
So in each Databricks task, there’s only one action. - Each Databricks task will start at different time – based on action.
- Some action with default shuffle.partitions (200) will start processing before the Databricks task with a changed shufle size to 800 will start processing.
- Everything happens in the same cluster.
So the question:
- How such a change of shuffle partition (for the whole cluster) will affect already processing jobs/tasks (Spark) which at the starting point had 200 partitions?
- Will it change to 800 for all the already initiated job/tasks (Spark)? If so, won’t it affect somehow the process, before such job is finished?
7