Relative Content

Tag Archive for performancearchitecturebigdatascalabilitydistributed-system

Deduplication , Grouping for events table at scale

I’m working with an events table where different source tables trigger writes into this table with columns: entity_id and payload. These events are then published to a Kafka topic using a message relay service. The table is partitioned hourly based on event_time, handling a high scale of ~5M+ rows per hour. After a row is processed and published, we mark it as processed=true and drop partitions after 24 hours to avoid performance issues from deleting individual rows.