I have a large dataset stored in a table with columns entity_id, time, and group_id. I’m trying to efficiently retrieve unique records based on entity_id and time.
CREATE TABLE test_table (
entity_id int,
time BIGINT,
group_id INT
);
INSERT INTO test_table (entity_id, time, group_id) VALUES
(1, 1709049203345, 10086),
(2, 1708605859234, 435146),
(2, 1708605859234, 435146),
(4, 1707815681971, 10086),
(5, 1707291871198, 10086),
(5, 1707291871198, 10086);
entity_id | time | group_id |
---|---|---|
1 | 1709049203345 | 10086 |
2 | 1708605859234 | 435146 |
2 | 1708605859234 | 435146 |
4 | 1707815681971 | 10086 |
5 | 1707291871198 | 10086 |
5 | 1707291871198 | 10086 |
I’ve attempted using both a CTE
with ROW_NUMBER()
and the DISTINCT ON
clause, but both approaches are taking too long to process.
WITH cte AS (
SELECT
entity_id,
time,
group_id,
ROW_NUMBER() OVER (PARTITION BY entity_id, time ORDER BY entity_id) AS row_num
FROM
test_table
)
SELECT
entity_id,
time,
group_id
FROM
cte
WHERE
row_num = 1;
and
SELECT DISTINCT ON (time, entity_id) *
FROM test_table
ORDER BY time, entity_id;
Can anyone suggest a more efficient method for achieving this?
I don’t have write access to this database, I can only make SELECT statment.
db fiddle: https://www.db-fiddle.com/f/cQ6eELDAzHbr1nGK3iHBjK/0