- Say I have 9 jobs with estimates of how long they will take to complete, and 3 queues that process these jobs – this constitutes one batch. Note that all the jobs are allocated to a queue beforehand and not at run-time. So with poor-scheduling, some queues may remain idle while others continue to process a job.
- All these jobs should be distributed across the queues in such a way that the total time to complete all the queues is nearly equal.
Consider:
1.Below Q2,Q3 complete in 15mins and remain idle while Q1 takes another 10mins to complete.
Batch1:
+------+----+----+----+
| # | Q1 | Q2 | Q3 |
+------+----+----+----+
| 1 | 5 | 5 | 5 |
| 2 | 15 | 5 | 5 |
| 3 | 5 | 5 | 5 |
+------+----+----+----+
| Time | 25 | 15 | 15 |
+------+----+----+----+
| Total| 25 |
+------+----+----+----+
2.Now ideally the jobs should be distributed as…
+------+----+----+----+
| # | Q1 | Q2 | Q3 |
+------+----+----+----+
| 1 | 15 | 5 | 5 |
| 2 | - | 5 | 5 |
| 3 | - | 5 | 5 |
| 4 | - | 5 | 5 |
+------+----+----+----+
| Time | 15 | 20 | 20 |
+------+----+----+----+
| Total| 20 |
+------+----+----+----+
This seems to be a straightforward scheduling problem, but I am dealing with hundreds of such jobs, divided into multiple batches. Where each batch will start only on the completion of the first. So I would require some algorithm to decide the schedule. Any ideas..? Thanks
Update:
Adding more information based on feedback…
- The jobs in a batch are fixed.
- There are no dependencies between jobs within a batch, but there are dependencies between batches. So we can look at solving the problem for a single batch and the solution will apply across all batches.
- Real-world problem: I am modelling a problem on loading data (ETL) and the entire load process takes 5-6hrs, which has several batches. Looking for ways to reduce the overall time.
This is related to the knapsack problem, except that you have multiple, variable-sized knapsacks. That makes it enormously easier.
A simple algorithm is to sort the jobs by size and then, starting from the largest job, assign each (one at a time) to the queue with the least work assigned to it (so yes, you’ll need to keep a handle on how much work is already assigned to each queue from the batch). I’m not 100% sure if this will produce a best solution, but it will produce a pretty good one and it is easy to implement correctly.
Where things get more complex is when you have uncertainty in the costs, dependencies between jobs or where you have high priority tasks coming through and pausing some of the queues occasionally. You didn’t mention any of those, so I’d suggest going with the simple algorithm.
Let’s try assigning the jobs [5,5,15,5,5,5]
to two queues to demonstrate.
-
Sort the jobs:
[15,5,5,5,5,5]
-
Assign the jobs in order:
A:
[]
(size 0), B:[]
(size 0)
A:[15]
(size 15), B:[]
(size 0)
A:[15]
(size 15), B:[5]
(size 5)
A:[15]
(size 15), B:[5,5]
(size 10)
A:[15]
(size 15), B:[5,5,5]
(size 15)
A:[15,5]
(size 20), B:[5,5,5]
(size 15)
A:[15,5]
(size 20), B:[5,5,5,5]
(size 20)
That looks balanced.
1
Are the jobs in a batch fixed for example batchA – has the jobs jobA1, jobA2 etc.
Or we just have a lots of jobs, and we need to decide the batches as well as the queues in the batches.
It can be considered as a knapsack problem and you need to find three kanpsacks with not very much difference in the values.
The three knapsacks are the three scheduling queues.
If you can explain about the relationship between the jobs then it would help us all in finding the solution.
For example Job1 DEPENDS_ON Job2 and similarly.
1