In my CTE, i find out the duplicates. Then, through the SELECT statement and row_number, I decided which rows I want to delete.
WITH duplicates AS (
SELECT hash_key, load_date, count(*) AS cnt
FROM data_vault.crm
GROUP BY hash_key, load_date
HAVING cnt > 1
)
SELECT d.*,
ROW_NUMBER() OVER (PARTITION BY d.hash_key, d.load_date ORDER BY d.sent_at nulls first) AS row_num
FROM data_vault.crm d
INNER JOIN duplicates dup ON d.hash_key = dup.hash_key AND d.load_date = dup.load_date
Now, I want to delete all rows where row_num = 1. I want to delete rows from the original table data_vault.crm
How can I achieve this in Redshift SQL.
Before deleting, I would also like to count the total number of rows that would be deleted.