Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
223 views
in Technique[技术] by (71.8m points)

sse - efficient way to convert scatter indices into gather indices?

I'm trying to write a stream compaction (take an array and get rid of empty elements) with SIMD intrinsics. Each iteration of the loop processes 8 elements at a time (SIMD width).

With SSE intrinsics, I can do this fairly efficiently with _mm_shuffle_epi8(), which does a 16 entry table lookup (gather in parallel computing terminology). The shuffle indices are precomputed, and looked up with a bit mask.

for (i = 0; i < n; i += 8)
{
  v8n_Data = _mm_load_si128(&data[i]);
  mask = _mm_movemask_epi8(&is_valid[i]) & 0xff;     // is_valid is byte array
  v8n_Compacted = _mm_shuffle_epi8(v16n_ShuffleIndices[mask]);
  _mm_storeu_si128(&compacted[count], v8n_Compacted);

  count += bitCount[mask];
}

My problem is now I would like to implement this for Altivec SIMD too (don't ask why - misguided business decision). Altivec doesn't have an equivalent for _mm_movemask_epi8(), a critical ingredient. So, I will need to find a way to either

  1. emulate _mm_movemask_epi8() - seems expensive, several shifts and ORs

  2. directly generate the shuffle indices efficiently -

namely, index i will be the index of the ith valid element in the uncompacted data

element_valid:   0 0 1 0 1 0 0 1 0
gather_indices:  x x x x x x 6 4 1
scatter_indices: 3 3 2 2 1 1 1 0 0

It's simple to do this serially, but I need it to be parallel (SIMD). It seems easy to generate scatter indices with a prefix sum, but since neither AltiVec nor SSE has a scatter instruction, I need gather indices instead. Gather indices are the inverse function of the scatter indices, but how can that be gotten in parallel? I know in the pioneering days of GPU programming, converting scatters to gathers was a common technique, but none of those 2 described methods seem practical.

Maybe if not insisting the compaction preserves the element order will allow more efficient implementation? I can give that up.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

If you want to emulate _mm_movemask_epi8 and you just need an 8 bit scalar mask from 8 byte elements then you can do something like this using AltiVec:

#include <stdio.h>

int main(void)
{
    const vector unsigned char vShift = { 0, 1, 2, 3, 4, 5, 6, 7, 0, 0, 0, 0, 0, 0, 0, 0 };
                                            // constant shift vector

    vector unsigned char isValid = { 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
                                            // sample input

    vector unsigned char v1 = vec_sl(isValid, vShift);
                                            // shift input values
    vector unsigned int v2 = vec_sum4s(v1, (vector unsigned int)(0));
    vector signed int v3 = vec_sum2s((vector signed int)v2, (vector signed int)(0));
                                            // sum shifted values
    vector signed int v4 = vec_splat(v3, 1);
    unsigned int mask __attribute__ ((aligned(16)));
    vec_ste((vector unsigned int)v4, 0, &mask);
                                            // store sum in scalar

    printf("v1 = %vu
", v1);
    printf("v2 = %#vlx
", v2);
    printf("v3 = %#vlx
", v3);
    printf("v4 = %#vlx
", v4);
    printf("mask = %#x
", mask);

    return 0;
}

This is 5 AltiVec instructions versus 1 in SSE. You might be able to lose the vec_splat and get it down to 4.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...