I am learning about Spark and came across the following thing. I create a simple dataframe using range.
ids = spark.range(1,9,1,4)
Then I repartition this dataframe using some logic and build a new dataframe as follows.
from pyspark.sql.functions import col
q = ids.repartition(col("id") % 2)
Now, if I run debugString
on the new df.
q.rdd.toDebugString()
Output -
(1) MapPartitionsRDD[16] at javaToPython at NativeMethodAccessorImpl.java:0 []
| MapPartitionsRDD[15] at javaToPython at NativeMethodAccessorImpl.java:0 []
| SQLExecutionRDD[14] at javaToPython at NativeMethodAccessorImpl.java:0 []
| ShuffledRowRDD[13] at at LexicalThreadLocal.scala:63 []
+-(4) MapPartitionsRDD[12] at at LexicalThreadLocal.scala:63 [] <-- This RDD
| MapPartitionsRDD[11] at at LexicalThreadLocal.scala:63 []
| MapPartitionsRDD[10] at at LexicalThreadLocal.scala:63 []
| ParallelCollectionRDD[9] at at LexicalThreadLocal.scala:63 []
Now, observe the MapPartitionsRDD[12]
(the one just before ShuffledRowRDD) in the spark UI that is generated by running q.show()
. Here the number of partitions are 200 (set by spark.sql.shuffle.partitions). The numbers in name of MapPartitionsRDD
in below photo is different from above output since both are formed by separate commands.
After this exchange task, the number of partitions are reduced to 1 for the ShuffledRowRDD
. This can be verified in the UI as well.
If MapPartitionsRDD
has 4 partitions and ShuffledRowRDD
has 1 partition, where did the 200 partitions go?
This is my reasoning after gathering all the information I could.
- If
MapPartitionsRDD
had 200 partitions then it would have been kept in a separate stage in the UI since each stage ends at wide transformations. Also, it’s name should have been different and notMapRDD
. - The
ShuffledRowRDD
has 1 partition because of coalescing done by adaptive query execution of Spark. - So, partitioning and shuffling happened in-between the two stages. And the
number of partitions
shown in spark ui, the ones in black boxes are not related to the partitions in RDD.
Are the above reasons correct? If so, then what is the use of Exchange
in Spark UI? Also, what does the number of partitions
in Spark UI represent?