I have job running on spark 2.4.4 on hdfs cluster. It reads and joins two sources and does some processing of this data. Job is quite massive. shuffle-partitions are set up to 10000.
Job ends up on last task of one of the stages
First scenario:
24/07/30 07:20:00 WARN expressions.RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/07/30 07:20:00 WARN expressions.RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
24/07/30 07:20:01 INFO memory.TaskMemoryManager: Memory used in task 158596
24/07/30 07:20:01 INFO memory.TaskMemoryManager: Acquired by org.apache.spark.unsafe.map.BytesToBytesMap@c437ba3: 14.6 GB
24/07/30 07:20:01 INFO memory.TaskMemoryManager: Acquired by org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@d451207: 64.0 MB
24/07/30 07:20:01 INFO memory.TaskMemoryManager: Acquired by org.apache.spark.sql.catalyst.expressions.FixedLengthRowBasedKeyValueBatch@78ce3e2b: 64.0 MB
24/07/30 07:20:01 INFO memory.TaskMemoryManager: Acquired by org.apache.spark.sql.catalyst.expressions.VariableLengthRowBasedKeyValueBatch@5f53250: 64.0 MB
24/07/30 07:20:01 INFO memory.TaskMemoryManager: Acquired by org.apache.spark.unsafe.map.BytesToBytesMap@2085c96d: 256.0 KB
24/07/30 07:20:01 INFO memory.TaskMemoryManager: Acquired by org.apache.spark.shuffle.sort.ShuffleExternalSorter@714737fd: 32.0 KB
24/07/30 07:20:01 INFO memory.TaskMemoryManager: Acquired by org.apache.spark.unsafe.map.BytesToBytesMap@30fe0610: 256.0 KB
24/07/30 07:20:01 INFO memory.TaskMemoryManager: 0 bytes of memory were used by task 158596 but are not associated with specific consumers
24/07/30 07:20:01 INFO memory.TaskMemoryManager: 15838248960 bytes of memory are used for execution and 114520823 bytes of memory are used for storage
24/07/30 07:20:07 INFO sort.UnsafeExternalSorter: Thread 100 spilling sort data of 2.0 GB to disk (0 time so far)
There is some data-skew which causes spill to disk. But it is the loop, nothing progressing.
I tried less executors – in this scenario there was no spill to disk, but no logs at all, and job beginning to not doing anything at the same point.
What can cause that?
I tried to run job with less executors and substantially increased memory.