I am trying to generate a randomly sorted dataframe on databricks.
My go-to code is to use .orderBy(rand())
on the dataframe.
This, however, seems to trigger a SparkException quite often: SparkException: Job aborted due to stage failure: Total size of serialized results of 36 tasks (1038.9 MiB) is bigger than spark.driver.maxResultSize 1024.0 MiB.
After experimenting a but, I see that the error goes away if I decrease the number of partitions with repartition(10)
or something like this. This is a bit counterintuitive for me, as I expected that decreasing the number of partitions would increase the size of each chunk traveling through the network?
Could anybody explain why this isn’t the case?
// BOOM
display(my_df
.orderBy(rand()))
// WORKS FINE
display(my_df
.repartition(10)
.orderBy(rand()))
PS.: the original number of partitions in my DF is 662
P.S.2: Running Spark3.3.0 with Scala 2.12 on Databricks!