I have seen multiple examples of systems that are organized essentially like a pipeline. Each stage produces some intermediate files which are then consumed by the next stage. Also, these take a long time and contain custom code, shell scripts and everything else imaginable (as opposed to everything neatly written in programming language X).
Question: how does one develop / organize such a thing?
It feels like we could be using a build system, but in our case it’s the code of the stages that is changing, not the actual input. So it’d be nice if I could choose what to rerun. E.g. if the project is about mirroring a website and extracting its contents, I don’t want to wget everything just because I changed the later processing code a bit… but re-downloading the entire thing is still something we might consider later. Build systems usually do not give such a level of manual control. (Also, they are rarely capable of keeping track of what outputs depend on what units of code specifically.)
The other end is having a shell script with a single parameter specifying the stage to run. It’s completely manual, reliable, but not especially convenient.
I was wondering whether this is a class of problem that exists elsewhere, and if yes, how do people solve it?
1