My use case is that I have a pipeline of independent, stand alone programs, that I want to execute in a certain order on specific pieces of data that are output from previous pipeline stages.
The pipeline is entirely linear and doesn’t do anything in terms of alternate paths through the pipe.
I’m currently using SGE to do this and it works OK, however occasionally a job will overstep it’s memory bounds, fail, and all jobs that require that output data will fail. The pipe needs to be restarted in that case, and it seems that whatever is providing the fault tolerance in akka might solve that for me?
There are many ways to provide fault tolerance for your service. It’s heavily depending on your use case, existing tech you already have, and how pervasive your stack’s ecosystem is. It’s up to you to choose what to use and how to combine it in your code. Akka is a huge framework with a lot of functionality, and if you are using it only for providing fault tolerance, it can be an overkill, but again, it depends on your use case.
Retrying
The most naive is to to wrap your job in some retry strategy directly in your code, for example sometihng like: Spring Retry.
That would detect when your job has failed, and retry it again.
Resource Managers
Another, not so naive approach, would be to use a resource manager like YARN, Mesos, or some scheduling mechanism like AWS Batch(considering you are using SGE, probably not a good fit). All these technologies might be an overkill in your use case but they have very good capabilities in high availability environments, so they are solutions worth to be considered if you are chasing scale.
Monitor from outside
Another approach could be that each sub-job reports a healthcheck to a instance (let’s call it a Monitor) which sole purpose is to monitor the state of the pipeline. If a job fails some healthchecks, then the Monitor instance is responsible for restarting it, or restarting the whole pipe.
This seems to be the simplest and most logical approach in your use case, especially if you combine it with each job keeping state (or your Monitor keeping it for them), so if one of them fails, you can restart the pipeline from the LAST successfully completed job.
If you observe a normal pipeline built in Hadoop over YARN, a common approach for each job is to get its input from a PATH in hdfs (where you keep your state), and its output is also a PATH on hdfs, so other jobs can take it as their input. YARN, which is the resource manager, takes care of job state, progress, failures, etc, and allocating resources to that job. This sounds similar to your SGE case, it’s just you probably have to implement some (or a lot of the monitor + state logic). If you don’t care about state, and are fine restarting the pipeline from scratch, it’s much easier as you have only two very simple moving parts:
- detect when a job is failed
- restart the whole pipeline