I’m working on a system where:
- A producer sends approximately 100 million messages daily to a message queue.
- The consumer processes each message from the queue and produces multiple parts as output.
- Each part is published back to another queue, with all parts sharing a unique identifier linking them to the original message.
- Parts may arrive at different times, with a maximum delay of up to 4 hours.
- I need to aggregate all parts corresponding to a unique identifier before generating a final result and producing a report.
Reporting Requirement:
For each original message, I need to produce a report that summarizes:
- Total Messages: The total number of original messages sent (e.g., 10 messages).
- Successful Messages: The number of messages where all parts were processed successfully.
- Failed Messages: The number of messages where at least one part failed.
For example, if I send 10 messages, each broken down into 10 parts (100 parts total), the report should look like this:
- Total Messages: 10
- Successful Messages: 7
- Failed Messages: 3
Problem:
- High Volume: We handle around 100 million messages daily.
- Message Processing: The consumer processes a single message and produces multiple parts as output.
- Asynchronous Parts: Parts are sent to another queue and may arrive asynchronously, with up to a 4-hour delay.
- Aggregation Needed: All parts for a given identifier must be aggregated before the final result can be generated.
- Report Generation: The final report needs to reflect the success or failure status of each original message based on the status of its parts.
- How can I efficiently store and manage these parts until all of them are received? What storage or caching solutions would work best given the volume and delay?
- What strategies can I use to detect when all parts for a given identifier have arrived? How can I reliably trigger the final aggregation and report generation?
- How can I ensure this approach scales effectively, particularly given the high message volume and time constraints?
I’m looking for recommendations on design patterns, tools, or frameworks that can help manage this aggregation process at scale while ensuring reliability and performance.