You should not parallelize the Insertion Sort algorithm
in this manner. From the inner loop condition A[k-1]> key;
one can see that this algorithm assumes that for a given key
in the position k
of the array, if the actual key is bigger than the keys stored on the previous position of the array the swapping of elements stops. Hence, this algorithm assumes that the keys on the positions before k
are already sorted.
When you introduce parallelism, for instance, with two threads, threads 0
and 1
start from the beginning and middle of the array, respectively. However, the first half of the array is not sorted, according to the (wrong) assumption made by the algorithm, which consequently, leads to issues.
Let me illustrate the problem, let us sort the array = [-1,2,-3,4,-5,6,-7,8]
with 2 threads:
Let us fix a given execution ordered (in reality it is non-deterministic)
-
- Thread 0 takes k = 1 and key = 2; array status
[-1,2,-3,4,-5,6,-7,8]
-
- Thread 1 takes k = 5 and key = 6; array status
[-1,2,-3,4,-5,6,-7,8]
-
- Thread 0 takes k = 2 and key = -3; array status
[-3,-1,2,4,-5,6,-7,8]
-
- Thread 1 takes k = 6 and key = -7; array status
[-7,-3,-1,2,4,-5,6,8]
-
- Thread 0 takes k = 3 and key = 2; array status
[-7,-3,-1,2,4,-5,6,8]
-
- Thread 1 takes k = 7 and key = 8; array status
[-7,-3,-1,2,4,-5,6,8]
-
- Thread 0 takes k = 4 and key = 4; array status
[-7,-3,-1,2,4,-5,6,8]
Final result : [-7,-3,-1,2,4,-5,6,8]
On line 4 thread 1
takes the key -7
from position 6
and puts it in the end of the array sifting all its elements from positions 1 to 6
(included) one position to the right, so now -5
is on the old position of -7
. Since, the old position of -7
(6) will never be compared again -5
will stay there untouchable. Therefore, making the array unsorted.
An easy, albeit poor solution, would be to add the OpenMP ordered
clause to the parallel for
construct. However, using this constructor would basically sequentialize your code.
Another possible solution, although I am not 100% sure it fits your needs, would be to make your algorithm parallel by Regular Sampling. You can see here an example of this latter technique apply on quicksort
.
The struct of your algorithm is not the most suitable to be parallelize directly and achieve a good speedups. Since each iteration of the inner loop is interdependent, this will required the use of methods to unsure mutual exclusion, thus resulting in synchronization overhead. You have far better sorting algorithm that you can directly parallelize, typically the ones that use a divide-and-conquer strategy, for instance Radix Sort or Quick Sort, among others.