I have a snakemake pipeline that runs a genetic analysis. The genome has been split into many ‘regions’. These regions can run in parallel, and therefore I’ve used expand()
, and they all run as expected.
regions = ['r1', 'r2', 'r3', 'r4']
some_pattern = {region}/file.tsv
rule all:
input: expand(some_pattern, region=regions)
and there is a subsequent rule
rule process_region:
output: some_pattern
run:
...
The issue is, some regions are much more computationally complex, and therefore take more time to run, therefore I would prefer if they were front-loaded so they don’t hold up the pipeline for a long time at the end. Is there a way I can order the execution of the expanded pattern?
After searching, the short answer is no, there is no way order jobs in the same rule, however there is a workaround that will have the desired effect.
Create 2 patterns and 2 rules, then use the priority
flag to run the rule you need to first.
slow_regions = ['r1', 'r2']
regular_regions = ['r3', 'r4']
slow_pattern = {slow_region}/slow_file.tsv
regular_pattern = {regular_region}/file.tsv
rule all:
input: expand(some_pattern, slow_region=slow_regions),
expand(regular_pattern, regular_region=regular_regions)
and there the subsequent rules. Note priority
, also you might want to give it more resources as well. The default priority is 0, so anything greater will run first.
rule process_slow_region:
output: slow_pattern
priority: 1
threads: xx
run:
...
rule process_regular_region:
output: regular_pattern
run:
...