GCC supports the __builtin_clz(int x)
builtin, which counts the number of number of leading zeros (consecutive most-significant zeros) in the argument.
Among other things0, this is great for efficiently implementing the lg(unsigned int x)
function, which takes the base-2 logarithm of x
, rounding down1:
/** return the base-2 log of x, where x > 0 */
unsigned lg(unsigned x) {
return 31U - (unsigned)__builtin_clz(x);
}
This works in the straightforward way - in particular consider the case x == 1
and clz(x) == 31
- then x == 2^0
so lg(x) == 0
and 31 - 31 == 0
and we get the correct result. Higher values of x
work similarly.
Assuming the builtin is efficiently implemented, this ends much better than the alternate pure C solutions.
Now as it happens, the count leading zeros operation is essentially the dual of the bsr
instruction in x86. That returns the index of the most-significant 1-bit2 in the argument. So if there are 10 leading zeros, the first 1-bit is in bit 21 of the argument. In general we have 31 - clz(x) == bsr(x)
and in so bsr
in fact directly implements our desired lg()
function, without the superfluous 31U - ...
part.
In fact, you can read between the line and see that the __builtin_clz
function was implemented with bsr
in mind: it is defined as undefined behavior if the argument is zero, when of course the "leading zeros" operation is perfectly well-defined as 32 (or whatever the bit-size of int
is) with a zero argument. So __builtin_clz
was certainly implemented with the idea of being efficiently mapped to a bsr
instruction on x86.
However, looking at what GCC actually does, at -O3
with otherwise default options: it adds a ton of extra junk:
lg(unsigned int):
bsr edi, edi
mov eax, 31
xor edi, 31
sub eax, edi
ret
The xor edi,31
line is effectively a not edi
for the bottom 4 bits that actually matter, that's off-by-one3 from neg edi
which turns the result of bsr
into clz
. Then the 31 - clz(x)
is carried out.
However with -mtune=haswell
, the code simplifies into exactly the expected single bsr
instruction:
lg(unsigned int):
bsr eax, edi
ret
Why that is the case is very unclear to me. The bsr
instruction has been around for a couple decades before Haswell, and the behavior is, AFAIK, unchanged. It's not just an issue of tuning for a particular arch, since bsr
+ a bunch of extra instructions isn't going to be faster than a plain bsr
and furthermore using -mtune=haswell
still results in the slower code.
The situation for 64-bit inputs and outputs is even slightly worse: there is an extra movsx
in the critical path which seems to do nothing since the result from clz
will never be negative. Again, the -march=haswell
variant is optimal with a single bsr
instruction.
Finally, let's check the big players in the non-Windows compiler space, icc
and clang
. icc
just does a bad job and adds redundant stuff like neg eax; add eax, 31; neg eax; add eax, 31;
- wtf? clang
does a good job regardless of -march
.
0 Such as scanning a bitmap for the first set bit.
1 The logarithm of 0 is undefinited, and so calling our function with a 0 argument is undefined behavior.
2 Here, the LSB is the 0th bit and the MSB is the 31st.
3 Recall that -x == ~x + 1
in twos-complement.
See Question&Answers more detail:
os