Continuing on from my first question, I am trying to optimize a memory hotspot found via VTune profiling a 64-bit C program.
In particular, I'd like to find the fastest way to test if a 128-byte block of memory contains all zeros. You may assume any desired memory alignment for the memory block; I used 64-byte alignment.
I am using a PC with an Intel Ivy Bridge Core i7 3770 processor with 32 GB of memory and the free version of Microsoft vs2010 C compiler.
My first attempt was:
const char* bytevecM; // 4 GB block of memory, 64-byte aligned
size_t* psz; // size_t is 64-bits
// ...
// "m7 & 0xffffff80" selects the 128 byte block to test for all zeros
psz = (size_t*)&bytevecM[(unsigned int)m7 & 0xffffff80];
if (psz[0] == 0 && psz[1] == 0
&& psz[2] == 0 && psz[3] == 0
&& psz[4] == 0 && psz[5] == 0
&& psz[6] == 0 && psz[7] == 0
&& psz[8] == 0 && psz[9] == 0
&& psz[10] == 0 && psz[11] == 0
&& psz[12] == 0 && psz[13] == 0
&& psz[14] == 0 && psz[15] == 0) continue;
// ...
VTune profiling of the corresponding assembly follows:
cmp qword ptr [rax], 0x0 0.171s
jnz 0x14000222 42.426s
cmp qword ptr [rax+0x8], 0x0 0.498s
jnz 0x14000222 0.358s
cmp qword ptr [rax+0x10], 0x0 0.124s
jnz 0x14000222 0.031s
cmp qword ptr [rax+0x18], 0x0 0.171s
jnz 0x14000222 0.031s
cmp qword ptr [rax+0x20], 0x0 0.233s
jnz 0x14000222 0.560s
cmp qword ptr [rax+0x28], 0x0 0.498s
jnz 0x14000222 0.358s
cmp qword ptr [rax+0x30], 0x0 0.140s
jnz 0x14000222
cmp qword ptr [rax+0x38], 0x0 0.124s
jnz 0x14000222
cmp qword ptr [rax+0x40], 0x0 0.156s
jnz 0x14000222 2.550s
cmp qword ptr [rax+0x48], 0x0 0.109s
jnz 0x14000222 0.124s
cmp qword ptr [rax+0x50], 0x0 0.078s
jnz 0x14000222 0.016s
cmp qword ptr [rax+0x58], 0x0 0.078s
jnz 0x14000222 0.062s
cmp qword ptr [rax+0x60], 0x0 0.093s
jnz 0x14000222 0.467s
cmp qword ptr [rax+0x68], 0x0 0.047s
jnz 0x14000222 0.016s
cmp qword ptr [rax+0x70], 0x0 0.109s
jnz 0x14000222 0.047s
cmp qword ptr [rax+0x78], 0x0 0.093s
jnz 0x14000222 0.016s
I was able to improve on that via Intel instrinsics:
const char* bytevecM; // 4 GB block of memory
__m128i* psz; // __m128i is 128-bits
__m128i one = _mm_set1_epi32(0xffffffff); // all bits one
// ...
psz = (__m128i*)&bytevecM[(unsigned int)m7 & 0xffffff80];
if (_mm_testz_si128(psz[0], one) && _mm_testz_si128(psz[1], one)
&& _mm_testz_si128(psz[2], one) && _mm_testz_si128(psz[3], one)
&& _mm_testz_si128(psz[4], one) && _mm_testz_si128(psz[5], one)
&& _mm_testz_si128(psz[6], one) && _mm_testz_si128(psz[7], one)) continue;
// ...
VTune profiling of the corresponding assembly follows:
movdqa xmm0, xmmword ptr [rax] 0.218s
ptest xmm0, xmm2 35.425s
jnz 0x14000ddd 0.700s
movdqa xmm0, xmmword ptr [rax+0x10] 0.124s
ptest xmm0, xmm2 0.078s
jnz 0x14000ddd 0.218s
movdqa xmm0, xmmword ptr [rax+0x20] 0.155s
ptest xmm0, xmm2 0.498s
jnz 0x14000ddd 0.296s
movdqa xmm0, xmmword ptr [rax+0x30] 0.187s
ptest xmm0, xmm2 0.031s
jnz 0x14000ddd
movdqa xmm0, xmmword ptr [rax+0x40] 0.093s
ptest xmm0, xmm2 2.162s
jnz 0x14000ddd 0.280s
movdqa xmm0, xmmword ptr [rax+0x50] 0.109s
ptest xmm0, xmm2 0.031s
jnz 0x14000ddd 0.124s
movdqa xmm0, xmmword ptr [rax+0x60] 0.109s
ptest xmm0, xmm2 0.404s
jnz 0x14000ddd 0.124s
movdqa xmm0, xmmword ptr [rax+0x70] 0.093s
ptest xmm0, xmm2 0.078s
jnz 0x14000ddd 0.016s
As you can see, there are fewer assembly instructions and this version further proved to be faster in timing tests.
Since I am quite weak in the area of Intel SSE/AVX instructions, I welcome advice on how they might be better employed to speed up this code.
Though I scoured the hundreds of available instrinsics, I may have missed the ideal ones. In particular, I was unable to effectively employ _mm_cmpeq_epi64(); I looked for a "not equal" version of this instrinsic (which seems better suited to this problem) but came up dry. Though the below code "works":
if (_mm_testz_si128(_mm_andnot_si128(_mm_cmpeq_epi64(psz[7], _mm_andnot_si128(_mm_cmpeq_epi64(psz[6], _mm_andnot_si128(_mm_cmpeq_epi64(psz[5], _mm_andnot_si128(_mm_cmpeq_epi64(psz[4], _mm_andnot_si128(_mm_cmpeq_epi64(psz[3], _mm_andnot_si128(_mm_cmpeq_epi64(psz[2], _mm_andnot_si128(_mm_cmpeq_epi64(psz[1], _mm_andnot_si128(_mm_cmpeq_epi64(psz[0], zero), one)), one)), one)), one)), one)), one)), one)), one), one)) continue;
it is borderline unreadable and (unsurprisingly) proved to be way slower than the two versions given above. I feel sure there must be a more elegant way to employ _mm_cmpeq_epi64() and welcome advice on how that might be achieved.
In addition to using intrinsics from C, raw Intel assembly language solutions to this problem are also welcome.
See Question&Answers more detail:
os