This is probably more of an architecture/tooling question than a code one.
The Issue: I’m trying to figure out the most reasonable way to coordinate a bunch of instances of worker microservices that have dedicated, continuous workloads (that is: they aren’t taking work from a queue; they are keeping a connection open to an external service and listening for notifications from it).
Scenario: I’ve built a service that connects to Microsoft Exchange (either on-premise, or Exchange Online, depending on the configuration of a given mailbox) in order to monitor a set of mailboxes for incoming emails and take action on them. (C# using .NET 6, soon to be .NET 8)
This “Monitoring Service” can be watching any number of mailboxes, from 0 to hundreds. (Currently it’s only handling a dozen or so, which it’s doing by actively polling the mailboxes, but the plan is to move to using event notifications from the servers to limit the network traffic and the load on Exchange. The number of different Exchange nodes makes that complicated to do within a single instance, hence part of the desire to scale this out horizontally. The CPU load and memory footprint are essentially zero except when processing an incoming message, and quickly returns to baseline.)
We’re using containers (via Pivotal Cloud Foundry) to host the services, and I would like to be able to spin up multiple instances so that I can handle large numbers of mailboxes, and simplify the application logic so that any given instance only has to deal with one Exchange node at a time.
I know I could write a service to coordinate these Monitoring Services; as a service spins up, it registers with the coordinating service, which hands it a block of email addresses to manage. When another service spins up and registers, the coordinating service calls back to the first service to claw back some of the workload, and redistributes it to the new server. When one of the Monitoring Services becomes non-responsive (either fails to actively display a heartbeat, or to respond to a request of the coordinating service), its workload is redistributed to the responsive Monitoring Services.
But this feels like the sort of thing that probably has an out-of-the-box solution. So, before I go and roll my own solution, am I missing something? Is there some better / easier / cleaner solution?