c - GCC compiles leading zero count poorly unless Haswell specified

Question

Welcome To Ask or Share your Answers For Others

c - GCC compiles leading zero count poorly unless Haswell specified

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c - GCC compiles leading zero count poorly unless Haswell specified

GCC supports the __builtin_clz(int x) builtin, which counts the number of number of leading zeros (consecutive most-significant zeros) in the argument.

Among other things⁰, this is great for efficiently implementing the lg(unsigned int x) function, which takes the base-2 logarithm of x, rounding down¹:

/** return the base-2 log of x, where x > 0 */
unsigned lg(unsigned x) {
  return 31U - (unsigned)__builtin_clz(x);
}

This works in the straightforward way - in particular consider the case x == 1 and clz(x) == 31 - then x == 2^0 so lg(x) == 0 and 31 - 31 == 0 and we get the correct result. Higher values of x work similarly.

Assuming the builtin is efficiently implemented, this ends much better than the alternate pure C solutions.

Now as it happens, the count leading zeros operation is essentially the dual of the bsr instruction in x86. That returns the index of the most-significant 1-bit² in the argument. So if there are 10 leading zeros, the first 1-bit is in bit 21 of the argument. In general we have 31 - clz(x) == bsr(x) and in so bsr in fact directly implements our desired lg() function, without the superfluous 31U - ... part.

In fact, you can read between the line and see that the __builtin_clz function was implemented with bsr in mind: it is defined as undefined behavior if the argument is zero, when of course the "leading zeros" operation is perfectly well-defined as 32 (or whatever the bit-size of int is) with a zero argument. So __builtin_clz was certainly implemented with the idea of being efficiently mapped to a bsr instruction on x86.

However, looking at what GCC actually does, at -O3 with otherwise default options: it adds a ton of extra junk:

lg(unsigned int):
        bsr     edi, edi
        mov     eax, 31
        xor     edi, 31
        sub     eax, edi
        ret

The xor edi,31 line is effectively a not edi for the bottom 4 bits that actually matter, that's off-by-one³ from neg edi which turns the result of bsr into clz. Then the 31 - clz(x) is carried out.

However with -mtune=haswell, the code simplifies into exactly the expected single bsr instruction:

lg(unsigned int):
        bsr     eax, edi
        ret

Why that is the case is very unclear to me. The bsr instruction has been around for a couple decades before Haswell, and the behavior is, AFAIK, unchanged. It's not just an issue of tuning for a particular arch, since bsr + a bunch of extra instructions isn't going to be faster than a plain bsr and furthermore using -mtune=haswell still results in the slower code.

The situation for 64-bit inputs and outputs is even slightly worse: there is an extra movsx in the critical path which seems to do nothing since the result from clz will never be negative. Again, the -march=haswell variant is optimal with a single bsr instruction.

Finally, let's check the big players in the non-Windows compiler space, icc and clang. icc just does a bad job and adds redundant stuff like neg eax; add eax, 31; neg eax; add eax, 31; - wtf? clang does a good job regardless of -march.

⁰ Such as scanning a bitmap for the first set bit.

¹ The logarithm of 0 is undefinited, and so calling our function with a 0 argument is undefined behavior.

² Here, the LSB is the 0th bit and the MSB is the 31st.

³ Recall that -x == ~x + 1 in twos-complement.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

c - GCC compiles leading zero count poorly unless Haswell specified

c - GCC compiles leading zero count poorly unless Haswell specified

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags