I’m struggling a bit with deduplication and I was wondering if you had an idea on how we could solve this:
- We have a
SharedMergeTree
table that stores events (1B+ rows). We currently have this PK:(event, flowGroup, flowId, timestamp)
. It contains other columns, as well as a uniqueid
. - The
timestamp
column is calculated so when we receive multiple requests for the same event,timestamp
can change. It means we can’t use theReplacingMergeTree
with this sort key because it would only work if the reinsert comes from a retry, not different requests. The PK needs to be a prefix of the sort key too so we can’t have the timestamp only in the PK. - Because materialized views are just insert triggers, we can’t use them to deduplicate the data by having an input
ReplacingMergeTree
table with just theid
in the sort key, and a materialized view to output the data to another table with the desired PK. - We tried using a subquery in the
INSERT
statement to check if the data already exists, but while it works logically, we hit performance issues.
insert into tracks
select
'xxx' as event,
'yyy' as flowGroup,
'zzz' as flowId,
'2021-01-01 00:00:00.123' as timestamp,
'2021-01-01 00:00:00.012' as originalTimestamp,
'1234' as id
where not exists (
select 1 from tracks
where
-- Narrow down by the PK before checking the id
event = 'xxx'
and flowGroup = 'yyy'
and flowId = 'zzz'
-- Timestamp can change, but will always be equal or greater than originalTimestamp
and timestamp >= '2021-01-01 00:00:00.012'
and id = '1234'
)
I considered storing just the id
in a separate table and do a lookup by ID but my understanding is that the performance of selecting unique IDs from huge tables isn’t great either. Maybe a table that has (event, flowGroup, flowId, id)
would make the lookup faster? It would still potentially contain a large number of rows for a single combination of event, flowGroup, flowId
.
We have a decent amount of duplications ingested (~416k duplicated rows out of ~24.5M ingested events for 1 week) so I’m pretty sure mitigating that would improve performance quite a bit as well as reducing storage. This is not a problem for querying as we just group data so duplicates aren’t accounted for.
Thank you!
2