By default, nested parallelism
is disabled. Nonetheless, you can explicitly enable nested parallelism
, by either:
omp_set_nested(1);
or by setting the OMP_NESTED environment variable to true.
also from the OpenMP standard we know that:
When a thread encounters a parallel construct, a team of threads is
created to execute the parallel region.
The thread that encountered the parallel construct becomes
the master thread of the new team, with a thread number of zero for
the duration of the new parallel region. All threads in the new team,
including the master thread, execute the region. Once the team is
created, the number of threads in the team remains constant for the
duration of that parallel region.
From source you can read the following.
OpenMP parallel regions can be nested inside each other. If nested
parallelism is disabled, then the new team created by a thread
encountering a parallel construct inside a parallel region consists
only of the encountering thread. If nested parallelism is enabled,
then the new team may consist of more than one thread.
This explains the reason why when you add the second parallel region
there is only one thread per team executing the enclosing code (i.e., the for loop). In other words, from the first parallel region
, 4
threads are created, each of those threads when encountering the second parallel region
will create a new team and become the master of that team (i.e., will have the ID=0
within the newly created team). However, because you did not explicitly enable the nested parallelism, each of those teams is only composed of a single thread. Hence, 4
teams with a thread each will execute the for loop. Consequently, you will have the following statement:
printf("i = %d, I am Thread %d
", i, omp_get_thread_num());
being printed 6 x 4 = 24 times
(i.e., the total number of loop iterations multiple by the total number of threads across the 4
teams). The image below provides a visualization of that flow:
If you add a printf
statement between the first and the second parallel region
, as follows:
int main() {
omp_set_num_threads(4);
#pragma omp parallel
{
printf("Before nested parallel region: I am Thread{%d}
", omp_get_thread_num());
#pragma omp parallel for // Adding "parallel" is the cause of the problem, but I don't know how to explain it.
for (int i = 0; i < 6; i++)
{
printf("i = %d, I am Thread %d
", i, omp_get_thread_num());
}
}
return 0;
}
You would get something similar to the following output (bear in mind that the order in which the first 4
lines are outputted is nondeterministic).
Before nested parallel region: I am Thread{1}
Before nested parallel region: I am Thread{0}
Before nested parallel region: I am Thread{2}
Before nested parallel region: I am Thread{3}
i = 0, I am Thread 0
i = 0, I am Thread 0
i = 0, I am Thread 0
(...)
i = 5, I am Thread 0
Meaning that within the first parallel region
(but still outside of the second parallel region) there is a single team of 4 threads -- with IDs
varying from 0
to 3
-- executing in parallel. Hence, each of those threads will execute the printf
statement:
printf("I am Thread outside the nested region {%d}
", omp_get_thread_num());
and display a different value for the omp_get_thread_num()
method call.
As previously mentioned, the nested parallelism is disabled. Thus, when each of those threads encounters the second parallel region
, each will create a new team and becomes the master (i.e., will have the ID=0
within the newly created team). -- and the only member -- of that team. Hence, why the statement
printf("i = %d, I am Thread %d
", i, omp_get_thread_num());
inside the loop, outputs always (..) I am Thread 0
, since the method omp_get_thread_num()
in this context will return always 0
. However, even though the method omp_get_thread_num()
is returning 0
, it does not imply that the code is being executed sequently (by the thread with ID=0
), but rather that each master of each of the 4
teams is returning their ID=0
.
If you enabled the nested parallelism you will have a flow like shown in the image below:
The execution of threads 1
to 3
was omitted for simplicity sake, nonetheless it would have been the same as thread 0.
So, from the first parallel region
, a team with 4
threads is created. After encountering the next parallel region
each thread from the previous team, will create a new team of 4
threads each, so at the moment we have a total of 16
threads across 4
teams. Finally, each team will execute the entire for loop. However, because you have a #pragma omp parallel for
constructor, the iterations of the for loop will be divided among the threads within each team.
Bear in mind that in the image above, I am assuming a certain static
loop distribution of iterations among loops, I am not implying that the loop iterations will always be divided like this across all the implementations of the OpenMP standard.