I have a 2B+ row external table which I am reading with Redshift Spectrum. There are two relevant columns – key_col (BIGINT) and partition_col (VARCHAR). partition_col is a partition column, and there are ~50 partitions.
I need to find if there are any duplicate values in key_col, and if so to flag the partition they are in. If any do exist, they are guaranteed to be in the same partition.
Currently, this is being accomplished with
SELECT
COUNT(key_col) as occurences,
partition_col
FROM my_table
GROUP BY
partition_col,
key_col
HAVING
occurences > 1;
However, with the volume of data this can be really slow. Looking at the EXPLAIN plan, it seems like Redshift might not be optimizing to take advantage of the fact partition_col is a partition, and instead is just HashAggregate-ing all of the partition_col/key_col pairs across all partitions. I would think (?) it could be done by performing the HashAggregate within each partition, which would dramatically reduce the number of comparisons being made to complete the full query. It’s possible this is being done and just isn’t clear from the EXPLAIN plan, but regardless my question is:
How might this query be rewritten to explicitly take advantage of the partitioning (or is it not possible)?
I’ve tried a few approaches with LAG and ROW_NUMBER window functions, but while the logic is pretty straightforward both require sorting by key_col which is also extremely slow due to the volume.