My scenario –
-
There’s a small set of producers (< 5) that each produce a fixed length message of ~20 characters which is of only one type.
-
The max overall rate of message production is 3 per second across all producers.
-
There could be up to 4000 consumers reading these messages. Each consumer needs to read every message at least once. Message processing is super cheap.
-
[Ordering] All messages need to be read in the order in which they are published.
-
The producers and consumers both are running in different AWS ECS clusters in an auto-managed cloud deployment. So they are expected to be scaled up / down at any time.
-
[Failure scenario] When a consumer thread (ECS task) goes down, there are two possibilities: Unrecoverable failure (e.g any fatal error that gets the ECS task killed), or recoverable failure (subscriber thread paused / died, max retries for re-connection failed etc but ECS task still healthy).
- Recoverable failures -> Either the frequency of this scenario is minimized to 0.01% or less and we build a manual SOP to kill the ECS task where failures happen. Or, the frequency can’t be guaranteed to be < 0.01%, then we need to have a mechanism so the consumer comes back online and reads all the messages it missed out.
-
I don’t need to persist any message once it’s consumed by all consumers.
Note:
-
Auto-scaling of consumer applications is set on Max overall CPU util as 60 – 70% , so we don’t expect the CPU to be higher than 70% consistently.
-
Any consumer failure to process the message would be identified via alarms and handled manually (since this is expected to be very rare).
Would Redis streams suit my scenario given that I can get ACK from consumers and based on that decide whether the message was published successfully ?