I’m working with an events table where different source tables trigger writes into this table with columns: entity_id and payload. These events are then published to a Kafka topic using a message relay service. The table is partitioned hourly based on event_time, handling a high scale of ~5M+ rows per hour. After a row is processed and published, we mark it as processed=true and drop partitions after 24 hours to avoid performance issues from deleting individual rows.
I need a way to add deduplication to this table. Specifically, if an unprocessed row with the same entity_id and entity_type already exists, I want to update the payload instead of adding multiple events.
Also I am fetching rows as where processed=false order by event_time asc limit 1000; Is there a way to fetch grouping by entity_type?
Is there an efficient technique to achieve deduplication at insertion and grouping by entity_type at fetch at this at scale?
Table:
Create events (
id Big serial not null,
entity_type text not null, #source table identifier
entity_id text not null, #source table pk
event_time timestamps default current_time,
payload json,
processed boolean default false,
Primary key (id, event_time)
) partition by range (event_time)
Stack: Database: Postgres, Partition management: pgpartman, Relay service: Java
Forece85 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.