An application App-a I maintain publishes some events to an event bus for other team’s application App-z. So far, the events were published right after being generated.
Now, I need to delay events ~30 minutes before publishing them. This is fairly easy to implement, but my concern is, that when App-a shuts down (restart, new version, etc), which typically happens a couple of times a week, these events will be lost.
What are strategies to deal with this situation? I have a couple of ideas, none of which I like too much:
- Persist the events somewhere, then have a scheduled job check the event dates and publish the ones older than 30 minutes. Cons: so far the application App-a doesn’t persist anything so I would need to create & manage a new storage, schema, driver… Plus I would also need a new application to run the scheduled job.
- Have an intermediate service App-k doing the delay. The original application App-a would publish them as currently, then App-k can consume the events, wait for 30 minutes, publish them for App-a, and only at this point acknowledge event consumption to the event bus, to avoid having the same problem (losing in-memory events on shutdown) in App-k. Our broker is not Kafka so acknowledging an event is not as simple as committing and offset, so I’m not even sure waiting 30 for the ack is possible.
Before I get into any complicated and potentially bad mechanism I would like to hear suggestions for this problem.
2
I understand that you would prefer not to introduce a persistence mechanism to App-a. This answer will probably not be exactly what you are looking for. But IMHO, there are only a couple of ways that you can solve the problem, and both involve persistence.
The problem at the end is simply that you need the system to be resilient. You would want to survive (from Application crash, random server issues, and downtime) and progress from the point of failure. There is no way to avoid some persistence. Even by introducing systems like App-k, you are spreading the failure points around, but there is no guarantee that you will not lose messages on system failure.
Most (All?) event-driven systems persist events as soon as they are raised. This helps not only keep a record for post-performance analysis but serves as an efficient way to trigger events again when required. You would want to retrigger for many reasons: Your message processing system may have gone down, you may want to change the business logic and process events again, or you want to reply the events in a different environment to simulate the production environment.
With this in mind, you can persist in two ways (which you may already be aware of):
- Persist as part of the message layer. Brokers like Redis and RabbitMQ can be configured to persist a message as soon as they are submitted to the queue. You never lose them once they are in the queue, and ack/nack to mark the messages as processed. But you seem to have a legacy/existing broker, so these options may not be feasible.
- Persist the event in a database table with a status. Perform the processing of the event while updating the status as the event is cycled through the business process. This is the most reliable way I have encountered in the past, and I was able to handle extremely complex scenarios (time delays, ack/nack, replay events in staging/test systems, replay event from error state, etc.)
2