I need to write a program, which will perform LU-decomposition, etc. The problem is that I don’t know about the preferred way to distribute the loaded matrix from the root process to other processes. I’m able to create a simple algorithm for some situations, but I really need the solution which works for arbitrary number of processes.
I’d prefer some Coarse-grain distribution according to this presentation (slide 18).
Example
Here I can see, that we can use 1 process or 2 processes (3 or 4 would be “too fine-grain”). But, what’s the correct way with 2 processes? Should the matrix be divided by rows or columns?
Example 2
In this case, it’s even more problematic. I can’t distribute values among 2 processes evenly. Even if I’d distributed it like this:
P1: 2 8 3 4
P2: 5 1 6 2 5
It completely throws the symmetrical distribution into disarray.
Then, with 3 processes it’s exactly the same situation like in Example 1 – rows or columns?
So, I presume there is no way to do this for a completely arbitrary number of processes, because the point is in the even distribution. What’s the approach for this?
I hope I’ve described my problem clearly enough. If not, leave a comment, please, and I’ll improve my question.
Another solution generally used when performing tasks in parallel is that:
-
You start by giving the first element to the first process, the second element to the second process,… and the nth element to the nth process.
-
Once one of the processed finishes the task for the given element, it is given the task to deal with the next element.
-
Once there are no elements left and all processes finished, the work is done.
This makes the number of elements irrelevant, and targets specifically the cases where you can’t be sure that the tasks will take exactly the same duration.
You may have two processes and nine tasks and end up giving eleven tasks to the first process and six to the other, because either some tasks given to the first process were simpler, or because the first process had more CPU cycles available to it.
The same issue (different duration of tasks) applies to most situations where parallel computing is required, such as map-reduce with individual tasks given to different machines (what if one machine has much more powerful hardware than the others?)
Often, the solution is to have many parallel tasks. If you have two tasks and one takes longer than another, there is nothing you can do. If you have twenty tasks and one takes longer, you may compensate by giving additional tasks to the process or machine which doesn’t deal with the long-running task.
This also works with tasks which always take the same amount of time, that is a situation where symmetrical distribution is suitable. Imagine, in your example, that each task takes one second. Then the distribution:
P1: 2 8 3 4
P2: 5 1 6 2 5
means that on the overall process which takes five seconds, P1
processing power is wasted during one second, which may be unsuitable (11.1% of wasted processing power). On the other hand, if you have nearly twice as more tasks:
P1: 2 8 3 4 1 9 8 8 4
P2: 5 1 6 2 5 6 3 1 4 1
then P1
wastes one second out of ten seconds, which is slightly better (5.3% waste). Now imagine you have 1 501 equally fast tasks to split between two processes (<0.1%), then the distribution works pretty well, even if it’s not symmetrical.