I’m working with Spark and encountering an unexpected sort operation during a join of two pre-sorted and bucketed tables. Both tables have been created with the same number of buckets and are sorted by the join key. However, when I perform the join operation, Spark still includes a sort step in the execution plan.
Here’s a simplified version of my Spark SQL query:
SELECT a.*, b.*
FROM tableA a
JOIN tableB b ON a.joinKey = b.joinKey
Both tableA
and tableB
are bucketed and sorted by joinKey
. Despite this, the execution plan includes sort operations:
*(1) Sort [joinKey#1 ASC NULLS FIRST], false, 0
...
*(2) Sort [joinKey#2 ASC NULLS FIRST], false, 0
I would expect Spark to optimize the join by leveraging the existing sort order without the need for an additional sort step. This is particularly surprising as there is no data skew, and the tables are well-distributed.
enter image description here