I have a question about the following task: (我对以下任务有疑问:)
"Given a two-dimensional array "a[N][M]" so N lines of length M. Each element of the array contains an random integer value between 0 and 16. Write a kernel "compact(int *a, int *listM, int *listN)" that consists of only one block of N threads, and each thread counts for one line of the array how many elements have a value of 16. (“给出一个二维数组“ a [N] [M]”,因此N行的长度为M。该数组的每个元素都包含一个介于0到16之间的随机整数值。 listM,int * listN)“,它仅由N个线程的一个块组成,并且每个线程在数组的一行中计算有多少个元素的值为16。)
The threads write these numbers into an array "num" of length N in shared memory, and then (after a barrier) one of the threads executes the prefix code "PrefixSum(int *num, int N)" listed below (In the code below i explain, what this code does). (线程将这些数字写入共享内存中长度为N的数组“ num”中,然后(在屏障之后)一个线程执行下面列出的前缀代码“ PrefixSum(int * num,int N)”(在代码中下面我解释一下,这段代码做什么)。) Finally (again barrier), each thread "Idx" writes the N- and M-values, respectively positions, (or "x- and y-coordinates") of the elements of its row that have a value of 16 into two arrays "listM" and "listN" in global memory, starting at the position "num[Idx]" in these arrays. (最后(再次障碍),每个线程“ Idx”将其行中元素的值为16的N和M值分别位置(或“ x和y坐标”)写入两个数组中。全局内存中的listM”和“ listN”,从这些数组中的“ num [Idx]”位置开始。) In order to realize this last task more easily, there is the prefix code mentioned above." (为了更轻松地完成最后一项任务,上面有前缀代码。”)
I've written a kernel and a suitable main to test it. (我已经编写了一个内核和一个合适的主内核来对其进行测试。) However, I still have a problem that I can not solve. (但是,我仍然有一个我无法解决的问题。)
In the two arrays "listeM" and "listeN", the individual positions of each 16 occurring in the array "a[M][N]" should be stored. (在两个数组“ listeM”和“ listeN”中,应存储出现在数组“ a [M] [N]”中的每个数组的各个位置。) Therefore, their size must be equal to the total number of occurrences of 16, which may vary. (因此,它们的大小必须等于16的出现总数,该总数可能会有所不同。)
Since you do not know the exact number of elements with the value 16, you only know at runtime of the kernel how much memory is needed for the two arrays "listeM" and "listeN". (由于您不知道值16的确切数目,因此只知道在内核运行时两个数组“ listeM”和“ listeN”需要多少内存。) Of course you could just release enough memory for the maximum possible number at program start, namely N times M, but that would be very inefficient. (当然,您可以在程序启动时释放最大数目的内存,即N乘以M,但这将是非常低效的。) Is it possible to write the kernel so that every single thread dynamically enlarges the two arrays "listeM" and "listeN" after counting the number of elements with the value 16 in its row (just this number)? (是否可以编写内核,以便每个线程在计算其行中值为16的元素的数目(仅此数目)后,动态地扩大两个数组“ listeM”和“ listeN”?)
Here is my Kernel: (这是我的内核:)
__global__ void compact(int* a, int* listM, int* listN)
{
int Idx = threadIdx.x;
int elements, i;
i = elements = 0;
__shared__ int num[N];
for (i = 0; i < M; i++)
{
if (a[Idx][i] == 16)
{
elements++;
}
}
num[Idx] = elements;
//Here at this point, the thread knows the number of elements with the value 16 of its line and would
//need to allocate just as much extra memory in "listeM" and "listeN". Is that possible ?
__syncthreads();
if (Idx == 0)
{
//This function sets the value of each element in the array "num" to the total value of the
//elements previously counted in all lines with the value 16.
//Example: Input: num{2,4,3,1} Output: num{0,2,6,9}
PrefixSum(num, N);
}
__syncthreads();
// The output of PrefixSum(num, N) can now be used to realize the last task (put the "coordinates" of
//each 16 in the two arrays ("listM" and "listN") and each thread starts at the position equal the
//number of counted 16s).
for (i = 0; i < M; i++)
{
if (a[Idx][i] == 16)
{
listM[num[Idx] + i] = Idx;
listN[num[Idx] + i] = i;
}
}
}
ask by Rabobsel translate from so