Your code currently has a race condition, which is why the result is incorrect. To illustrate why this is, let's use a simple example:
You are running on 2 threads and the array is int input[4] = {1, 2, 3, 4};
. You initialize sum
to 0
correctly and are ready to start the loop. In the first iteration of your loop, thread 0 and thread 1 read sum
from memory as 0
, and then add their respective element to sum
, and write it back to memory. However, this means that thread 0 is trying to write sum = 1
to memory (the first element is 1
, and sum = 0 + 1 = 1
), while thread 1 is trying to write sum = 2
to memory (the second element is 2
, and sum = 0 + 2 = 2
). The end result of this code depends on which one of the threads finishes last, and therefore writes to memory last, which is a race condition. Not only that, but in this particular case, neither of the answers that the code could produce are correct! There are several ways to get around this; I'll detail three basic ones below:
#pragma omp critical
:
In OpenMP, there is what is called a critical
directive. This restricts the code so that only one thread can do something at a time. For example, your for
-loop can be written:
#pragma omp parallel for schedule(static)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
#pragma omp critical
sum += *tmpsum;
}
This eliminates the race condition as only one thread accesses and writes to sum
at a time. However, the critical
directive is very very bad for performance, and will likely kill a large portion (if not all) of the gains you get from using OpenMP in the first place.
#pragma omp atomic
:
The atomic
directive is very similar to the critical
directive. The major difference is that, while the critical
directive applies to anything that you would like to do one thread at a time, the atomic
directive only applies to memory read/write operations. As all we are doing in this code example is reading and writing to sum, this directive will work perfectly:
#pragma omp parallel for schedule(static)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
#pragma omp atomic
sum += *tmpsum;
}
The performance of atomic
is generally significantly better than that of critical
. However, it is still not the best option in your particular case.
reduction
:
The method you should use, and the method that has already been suggested by others, is reduction
. You can do this by changing the for
-loop to:
#pragma omp parallel for schedule(static) reduction(+:sum)
for(i = 0; i < snum; i++) {
int *tmpsum = input + i;
sum += *tmpsum;
}
The reduction
command tells OpenMP that, while the loop is running, you want each thread to keep track of its own sum
variable, and add them all up at the end of the loop. This is the most efficient method as your entire loop now runs in parallel, with the only overhead being right at the end of the loop, when the sum
values of each of the threads need to be added up.