My team runs a set of NodeJS services in a kubernetes cluster. Recently, we have found a need for a subset of our operations to happen in order, but the order only matters on a resource-by-resource basis.
Resources are identified by a multi-part key (eg. state->countyA->itemA
) and resources with divergent paths (e.g. state->countyA->itemA
and state->countyA->itemB
) can be run at the same time.
In addition, each part of the key is a resource in its own right: state->countyA
is just as much of a resource as state->countyA->itemA
, and the two are able to conflict with each other.
If there is a request for state->countyA->itemA
already queued/running when we get a request for state->countyA
, the request for state->countyA
must wait for the request for state->countyA->itemA
to finish before it can run.
Similarly, if there is a request for state->countyA
already queued/running when we get a request for state->countyA->itemA
, the request for state->countyA->itemA
must wait for the request for state->countyA
to finish before it can run, even if there are already other queued/running requests for state->countyA->itemA
in line before the state->countyA
request.
To explain the desired behavior in more transactional/locking terms:
- Each operation requests an exclusive lock on the resource they wish to operate on (
state->countyA->itemA
) and shared locks on each of the parent resources (state
andstate->countyA
). (All locks are released when the operation finishes) - An operation requests all necessary locks atomically. This ensures a global ordering for which operations requested locks first.
- Requests for the same lock are granted in FIFO order, waiting if the corresponding lock levels require it.
This behavior is easy enough to achieve when running a single service to manage the queue tree and perform dispatches. However, doing so suffers from a lot of the issues related to servers designed to run in one, and only one, instance. (single point of failure, scalability, problems if you ever need to replace the running instance, etc)
With all of that said, my question is as follows:
Are there any good, robust ways of achieving the desired behavior in a kubernetes cluster without having any instances that are “pets” (instances you need to be very careful about when/if/how you tear them down and throw them away)? Preferably using our existing tech stack (kubernetes, NodeJS, Microsoft SQL Server) rather than anything completely new, but we are willing to explore other options.
If there are, please give details.
I have looked into distributed FIFO queues and not managed to find anything online beyond Amazon SQS.
I have also looked into various options for distributed locking, including kubernetes leases, as that would give me a chance of implementing it myself without reinventing from a very low level. However, while I have found a few options talking about distributed locking, I haven’t found anything that seems to offer shared locks, requesting multiple locks atomically, or granting requests in a FIFO ordering.
1