assembly - How much faster are SSE4.2 string instructions than SSE2 for memcmp?

Question

Welcome To Ask or Share your Answers For Others

assembly - How much faster are SSE4.2 string instructions than SSE2 for memcmp?

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

assembly - How much faster are SSE4.2 string instructions than SSE2 for memcmp?

Here is my code's assembler

Can you embed it in c ++ and check against SSE4? At speed

I would very much like to see how stepped into the development of SSE4. Or is not worried about him at all? Let's check (I do not have support above SSSE3)

{ sse2 strcmp WideChar 32 bit }
function CmpSee2(const P1, P2: Pointer; len: Integer): Boolean;
asm
    push ebx           // Create ebx
    cmp EAX, EDX      // Str = Str2
    je @@true        // to exit true
    test eax, eax   // not Str
    je @@false     // to exit false
    test edx, edx // not Str2
    je @@false   // to exit false
    sub edx, eax              // Str2 := Str2 - Str;
    mov ebx, [eax]           // get Str 4 byte
    xor ebx, [eax + edx]    // Cmp Str2 4 byte
    jnz @@false            // Str <> Str2 to exit false
    sub ecx, 2            // dec 4
    { AnsiChar  : sub ecx, 4 }
    jbe @@true           // ecx <= 0 to exit true
    lea eax, [eax + 4]  // Next 4 byte
    @@To1:
    movdqa xmm0, DQWORD PTR [eax]       // Load Str 16 byte
    pcmpeqw xmm0, DQWORD PTR [eax+edx] // Load Str2 16 byte and cmp
    pmovmskb ebx, xmm0                // Mask cmp
    cmp ebx, 65535                   // Cmp mask
    jne @@Final                     // ebx <> 65535 to goto final
    add eax, 16                    // Next 16 byte
    sub ecx, 8                    // Skip 8 byte (16 wide)
    { AnsiChar  : sub ecx, 16 }
    ja @@To1                     // ecx > 0
    @@true:                       // Result true
    mov eax, 1                 // Set true
    pop ebx                   // Remove ebx
    ret                      // Return
    @@false:                  // Result false
    mov eax, 0             // Set false
    pop ebx               // Remove ebx
    ret                  // Return
    @@Final:
    cmp ecx, 7         // (ebx <> 65535) and (ecx > 7)
    { AnsiChar : cmp ecx, 15 }
    jae @@false       // to exit false
    movzx ecx, word ptr @@mask[ecx * 2 - 2] // ecx = mask[ecx]
    and ebx, ecx                           // ebx = ebx & ecx
    cmp ebx, ecx                          // ebx = ecx
    sete al                              // Equal / Set if Zero
    pop ebx                             // Remove ebx
    ret                                // Return
    @@mask: // array Mersenne numbers
    dw $000F, $003F, $00FF, $03FF, $0FFF, $3FFF
    { AnsiChar
    dw 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, 8191, 16383
    }
end;

Semple 32bit https://vk.com/doc297044195_451679410

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:50:17+0000

You called your function strcmp, but what you've actually implemented is an alignment-required memcmp(const void *a, const void *b, size_t words). Both movdqa and pcmpeqw xmm0, [mem] will fault if the pointer isn't 16B-aligned. (Actually, if a+4 isn't 16B-aligned, because you do the first 4 scalar and increment by 4 bytes.)

With the right startup code and movdqu, you could handle arbitrary alignments (reaching an alignment boundary for the pointer you want to use as a memory operand to pcmpeqw). For convenience, you could require that both pointers are wide-char-aligned to start with, but you don't need to (especially since you're just returning true/false, not negative / 0 / positive as a sort order.)

You're asking about performance of SSE2 pcmpeqw vs. pcmpistrm, right? (The explicit-length SSE4.2 instructions like pcmpestrm have worse throughput than the implicit-length versions, so use the implicit-length versions in your main loop when you're not close to the end of the string. See Agner Fog's instruction tables and microarch guide).

For memcmp (or carefully-implemented strcmp), the best you can do with SSE4.2 is slower than the best you can do with SSE2 (or SSSE3) on most CPUs. Maybe useful for very short strings, but not for the main loop of memcmp.

On Nehalem: pcmpistri is 4 uops, 2c throughput (with a memory operand), so with no other loop overhead, it can keep up with memory. (Nehalem only has 1 load port). pcmpestri has 6c throughput: 3x slower.

On Sandybridge through Skylake, pcmpistri xmm0, [eax] has 3c throughput, so it's a factor of 3 too slow to keep up with 1 vector per clock (2 load ports). pcmpestri has 4c throughput on most of those, so it's not as much worse. (Maybe useful for the last partial-vector, but not in the main loop).

On Silvermont/KNL, pcmpistrm is the fastest, and runs at one per 14 cycle throughput, so it's total garbage for simple stuff.

On AMD Jaguar, pcmpistri is 2c throughput, so it might actually be usable (only one load port). pcmpestri is 5c throughput, so it sucks.

On AMD Ryzen, pcmpistri is also 2c throughput, so it's crap there. (2 load ports and 5 uops per clock front-end throughput (or 6 uops if any (or all?) are from multi-uop instructions) mean you can go faster.

On AMD Bulldozer-family, pcmpistri has 3c throughput until Steamroller, where it's 5c. pcmpestri has 10c throughput. They're micro-coded as 7 or 27 m-ops, so AMD didn't spend a lot of silicon on them.

On most CPUs, they're only worth it if you're taking full advantage of them for stuff you can't do with just pcmpeq/pmovmskb. But if you can use AVX2 or especially AVX512BW, even doing complicated things might be faster with more instructions on wider vectors. (There are no wider versions of the SSE4.2 string instructions.) Maybe the SSE4.2 string instructions are still useful for functions that usually deal with short strings, because wide vector loops usually need more startup / cleanup overhead. Also, in a program that doesn't spend much time in SIMD loops, using AVX or AVX512 in one small function will still reduce your max turbo clock speed for the next millisecond or so, and could easily be a net loss.

A good inner loop should bottleneck on load throughput, or come as close as possible. movqdu / pcmpeqw [one-register] / pmovmskb/ macro-fused-cmp+jcc is only 4 fused-domain uops, so this is almost achievable on Sandybridge-family CPUs

See https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for an implementation and some benchmarks, but that's for C-style implicit-length strings where you have to check for 0 bytes. It looks like you're using explicit-length strings, so after checking that the lengths are equal, it's just memcmp. (Or I guess if you need to find the sort order instead of just equal / not-equal, you'd have to memcmp out to the end of the shorter string.)

For strcmp with 8-bit strings, on most CPUs it's faster not to use the SSE4.2 string instructions. See the comments on the strchr.com article for some benchmarks (of that implicit-length string version). glibc for example doesn't use the SSE4.2 string instructions for strcmp, because they're not faster on most CPUs. They might be a win for strstr though.

glibc has several SSE2/SSSE3 asm strcmp and memcmp implementations. (It's LGPLed, so you can't just copy it into non-GPL projects, but have a look at what they do.) Some of the string functions (like strlen) only branch per 64 bytes, and then come back to sort out which byte within the cache line had the hit. But their memcmp implementation just unrolls with movdqu / pcmpeqb. You can use pcmpeqw since you want to know the position of the first 16-bit element that's different, rather than the first byte.

Your SSE2 implementation could be even faster. You should use the indexed addressing mode with movdqa since it won't micro-fuse with pcmpeqw (on Intel Sandybridge/Ivybridge; fine on Nehalem or Haswell+), but pcmpeqw xmm0, [eax] will stay micro-fused without unlaminating.

You should unroll a couple times to reduce loop overhead. You should combine the pointer-increment with the loop counter so you cmp/jb instead of sub/ja: macro-fusion on more CPUs, and avoids writing a register (reducing the amount of physical registers needed for register-renaming).

Your inner loop, on Intel Sandybridge/Ivybridge, will run

@@To1:
movdqa xmm0, DQWORD PTR [eax]       // 1 uop
pcmpeqw xmm0, DQWORD PTR [eax+edx] // 2 uops on Intel SnB/IvB, 1 on Nehalem and earlier or Haswell and later.
pmovmskb ebx, xmm0                // 1 uop
cmp ebx, 65535
jne @@Final                     // 1 uop  (macro-fused with cmp)
add eax, 16                    // 1 uop
sub ecx, 8
{ AnsiChar  : sub ecx, 16 }
ja @@To1                     // 1 uop (macro-fused with sub on SnB and later, otherwise 2)

This is 7 fused-domain uops, so it can only issue from the front-end at best 7/4 cycles per iteration on mainstream Intel CPUs. This is very far from bottlenecking on 2 loads per clock. On Haswell and later, it's 6/4 cycles per iteration, because indexed addressing modes can stay micro-fused with 2-operand load-modify instruction like pcmpeqw, but not anything else (like pabsw xmm0, [eax+edx] (doesn't read destination) or AVX vpcmpeqw xmm0, xmm0, [eax+edx] (3 operands)). See Micro fusion and addressing modes.

This could be more efficient for small strings with better setup/cleanup, too.

In your pointer-setup code, you could save a cmp if you check for NULL pointers first. You can sub / jne to subtract and check for both equal with the same macro-fused compare and branch. (It will only macro-fuse on Intel Sandybridge-family, and only Haswell can make 2 macro-fusions in a single decode block. But Haswell/Broadwell/Skylake CPUs are common and becoming ever more common, and this has no downside for other CPUs unless equal-pointers is so common that doing that check first matters.)

In your return path: Always use xor eax,eax to zero a register whenever possible, not mov eax, 0.

You don't seem to avoid reading from past the end of the string. You should test your function with strings that end right at the end of a page, where the next page is unmapped.

xor ebx, [eax + edx] has zero advantages over cmp for the early-out scalar test. cmp/jnz can macro-fuse with the jcc, but xor can't.

You load a mask to handle the cleanup to cover the case where you read past the end of the string. You could probably still use the usual bsf to find the first difference in the bitmap. I guess invert it with not to find the first position that didn't compare equal, and check that that's less than the remaining string length.

Or you could generate the mask on the fly with mov eax, -1 and shr, I think. Or for loading it, you can sometimes use a sliding window into a ...,0,0,0,-1,-1,-1,... array, but you need sub-byte offsets so that doesn't work. (It works well for vector masks, if you wanted to mask and redo the pmovmskb. Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all).

Your way isn't bad, as long as it doesn't cache miss. I'd probably go for generating the mask on the fly. Maybe before the loop in another register, because you can mask to get count % 8, so the mask-generation can happen in parallel with the loop.

Categories

assembly - How much faster are SSE4.2 string instructions than SSE2 for memcmp?

assembly - How much faster are SSE4.2 string instructions than SSE2 for memcmp?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags