As @prl points out, you reversed the operands, putting them in Intel order (See Intel's manual entry for cmpxchg
). Any time your inline asm doesn't assemble, you should look at the asm the compiler was feeding to the assembler to see what happened to your template. In your case, simply remove the static inline
so the compiler will make a stand-alone definition, then you get (on the Godbolt compiler explorer):
# gcc -S output for the original, with cmpxchg operands backwards
movl %edx , %eax
lock; cmpxchg (%ecx), %ebx # error on this line from the assembler
pushfl
popl %edx
and $0x0040, %edx
Sometimes that will clue your eye / brain in cases where staring at %3
and %0
didn't, especially after you check the instruction-set reference manual entry for cmpxchg
and see that the memory operand is the destination (Intel-syntax first operand, AT&T syntax last operand).
This makes sense because the explicit register operand is only ever a source, while EAX and the memory operand are both read and then one or the other is written depending on the success of the compare. (And semantically you use cmpxchg
as a conditional store to a memory destination.)
You're discarding the load result from the cas-failure case. I can't think of any use-cases for cmpxchg
where doing a separate load of the atomic value would be incorrect, rather than just inefficient, but the usual semantics for a CAS function is that oldval
is taken by reference and updated on failure. (At least that's how C++11 std::atomic and C11 stdatomic do it with bool atomic_compare_exchange_weak( volatile A *obj, C* expected, C desired );
.)
(The weak/strong thing allows better code-gen for CAS retry-loops on targets that use LL/SC, where spurious failure is possible due to an interrupt or being rewritten with the same value. x86's lock cmpxchg
is "strong")
Actually, GCC's legacy __sync
builtins provide 2 separate CAS functions: one that returns the old value, and one that returns a bool
. Both take the old/new value by reference. So it's not the same API that C++11 uses, but apparently it isn't so horrible that nobody used it.
Your overcomplicated code isn't portable to x86-64. From your use of popl
, I assume you developed it on x86-32. You don't need pushf/pop
to get ZF as an integer; that's what setcc
is for. cmpxchg example for 64 bit integer has a 32-bit example that works that way (to show what they want a 64-bit version of).
Or even better, use GCC6 flag-return syntax so using this in a loop can compile to a cmpxchg / jne
loop instead of cmpxchg
/ setz %al
/ test %al,%al
/ jnz
.
We can fix all of those problems and improve the register allocation as well. (If the first or last instruction of an inline-asm statement is mov
, you're probably using constraints inefficiently.)
Of course, by far the best thing for real usage would be to use C11 stdatomic or a GCC builtin. https://gcc.gnu.org/wiki/DontUseInlineAsm in cases where the compiler can emit just as good (or better) asm from code it "understands", because inline asm constrains the compiler. It's also difficult to write correctly / efficient, and to maintain.
Portable to i386 and x86-64, AT&T or Intel syntax, and works for any integer type width of register width or smaller:
// Note: oldVal by reference
static inline char CAS_flagout(int *ptr, int *poldVal, int newVal)
{
char ret;
__asm__ __volatile__ (
" lock; cmpxchg {%[newval], %[mem] | %[mem], %[newval]}
"
: "=@ccz" (ret), [mem] "+m" (*ptr), "+a" (*poldVal)
: [newval]"r" (newVal)
: "memory"); // barrier for compiler reordering around this
return ret; // ZF result, 1 on success else 0
}
// spinning read-only is much better (with _mm_pause in the retry loop)
// not hammering on the cache line with lock cmpxchg.
// This is over-simplified so the asm is super-simple.
void cas_retry(int *lock) {
int oldval = 0;
while(!CAS_flagout(lock, &oldval, 1)) oldval = 0;
}
The { foo,bar | bar,foo }
is ASM dialect alternatives. For x86, it's {AT&T | Intel}
. The %[newval]
is a named operand constraint; it's another way to keep your operands . The "=ccz"
takes the z
condition code as the output value, like a setz
.
Compiles on Godbolt to this asm for 32-bit x86 with AT&T output:
cas_retry:
pushl %ebx
movl 8(%esp), %edx # load the pointer arg.
movl $1, %ecx
xorl %ebx, %ebx
.L2:
movl %ebx, %eax # xor %eax,%eax would save a lot of insns
lock; cmpxchg %ecx, (%edx)
jne .L2
popl %ebx
ret
gcc is dumb and stores a 0
in one reg before copying it to eax
, instead of re-zeroing eax
inside the loop. This is why it needs to save/restore EBX at all. It's the same asm we get from avoiding inline-asm, though (from x86 spinlock using cmpxchg):
// also omits _mm_pause and read-only retry, see the linked question
void spin_lock_oversimplified(int *p) {
while(!__sync_bool_compare_and_swap(p, 0, 1));
}
Someone should teach gcc that Intel CPUs can materialize a 0
more cheaply with xor-zeroing than they can copy it with mov, especially on Sandybridge (xor
-zeroing elimination but no mov
-elimination).