Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
264 views
in Technique[技术] by (71.8m points)

c++ - Genuinely test std::atomic is lock-free or not

Since std::atomic::is_lock_free() may not genuinely reflect the reality [ref], I'm considering writing a genuine runtime test instead. However, when I got down to it, I found that it's not a trivial task I thought it to be. I'm wondering whether there is some clever idea that could do it.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Other than performance, the standard doesn't guarantee any way you can tell; that's more or less the point.

If you are willing to introduce some platform-specific UB, you could do something like cast a atomic<int64_t> * to a volatile int64_t* and see if you observe "tearing" when another thread reads the object. (When to use volatile with multi threading? - normally never, but real hardware has coherent caches between cores that run threads so plain asm load/store are basically like relaxed-atomic.)

If this test succeeds (i.e. the plain C++ type was naturally atomic with just volatile), that tells you any sane compiler will make it lock-free very cheaply. But if it fails, it doesn't tell you very much. A lock-free atomic for that type may be only slightly more expensive than the plain version for loads/stores, or the compiler may not make it lock-free at all. e.g. on 32-bit x86 where lock-free int64_t is efficient with only small overhead (using SSE2 or x87), but volatile int64_t* will produce tearing using two separate 4-byte integer loads or stores the way most compilers compile it.

On any specific platform / target architecture, you can single-step your code in a debugger and see what asm instructions run. (Including stepping into libatomic function calls like __atomic_store_16). This is the only 100% reliable way. (Plus consulting ISA documentation to check atomicity guarantees for different instructions, e.g. whether ARM load/store pair is guaranteed, under what conditions.)

(Fun fact: gcc7 with statically linked libatomic may always use locking for 16-byte objects on x86-64, because it doesn't have an opportunity to do runtime CPU detection at dynamic link time and use lock cmpxchg16b on CPUs that support it, with the same mechanism glibc uses to pick optimal memcpy / strchr implementations for the current system.)


You could portably look for a performance difference (e.g. scalability with multiple readers), but x86-64 lock cmpxchg16b doesn't scale1. Multiple readers contend with each other, unlike 8 byte and narrower atomic objects where pure asm loads are atomic and can be used. lock cmpxchg16b acquires exclusive access to a cache line before executing; abusing the side-effect of atomically loading the old value on failure to implement .load() is much worse than an 8-byte atomic load which compiles to just a regular load instruction.

That's part of the reason that gcc7 decided to stop returning true for is_lock_free() on 16-byte objects, as described in the GCC mailing list message about the change you're asking about.

Also note that clang on 32-bit x86 uses lock cmpxchg8b to implement std::atomic<int64_t>, just like for 16-byte objects in 64-bit mode. So you would see a lack of parallel read scaling with it, too. (https://bugs.llvm.org/show_bug.cgi?id=33109)


std::atomic<> implementations that use locking usually still don't make the object larger by including a lock byte or word in each object. It would change the ABI, but lock-free vs. locking is already an ABI difference. The standard allows this, I think, but weird hardware might need extra bytes in the object even when lock-free. Anyway sizeof(atomic<T>) == sizeof(T) doesn't tell you anything either way. If it's larger it's most likely that your implementation added a mutex, but you can't be sure without checking the asm. (If the size wasn't a power of 2, it could have widened it for alignment.)

(In C11, there's much less scope for including a lock in the object: it has to work even with minimal initialization (e.g. statically to 0), and no destructor. Compilers / ABIs generally want their C stdatomic.h atomics to be compatible with their C++ std::atomic atomics.)

The normal mechanism is to use the address of the atomic object as a key for a global hash table of locks. Two objects aliasing / colliding and sharing the same lock is extra contention, but not a correctness problem. These locks are only taken/released from library functions, not while holding other such locks, so it can't create a deadlock.

You could detect this by using shared memory between two different processes (so each process would have its own hash table of locks). Is C++11 atomic<T> usable with mmap?

  • check that std::atomic<T> is the same size as T (so the lock isn't in the object itself).

  • Map a shared memory segment from two separate processes that don't otherwise share any of their address space. It doesn't matter if you map it to a different base address in each process.

  • Store patterns like all-ones and all-zeros from one process while reading from the other (and look for tearing). Same as what I suggested with volatile above.

  • Also test atomic increment: have each thread do 1G increments and check that the result is 2G every time. Even if pure load and pure store are naturally atomic (the tearing test), read-modify-write operations like fetch_add / operator++ need special support: Can num++ be atomic for 'int num'?

From the C++11 standard, the intent is that this should still be atomic for lock-free objects. It might also work for non-lock-free objects (if they embed the lock in the object), which is why you have to rule that out by checking sizeof().

To facilitate inter-process communication via shared memory, it is our intent that lock-free operations also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation shall not depend on any per-process state.

If you see tearing between two processes, the object wasn't lock-free (at least not the way C++11 intended, and not the way you'd expect on normal shared-memory CPUs.)

I'm not sure why address-free matters if the processes don't have to share any address-space other than 1 page containing the atomic object2. (Of course, C++11 doesn't require that the implementation uses pages at all. Or maybe an implementation could put the hash table of locks at the top or bottom of each page? In which case using a hash function that depended on address bits above the page offset would be totally silly.)

Anyway, this depends on a lot of assumptions about how computers work that are true on all normal CPUs, but which C++ doesn't make. If the implementation you care about is on a mainstream CPU like x86 or ARM under a normal OS, then this testing method should be fairly accurate and might be an alternative to just reading the asm. It's not something that's very practical to do automatically at compile time, but it would be possible to automate a test like this and put it into a build script, unlike reading the asm.


Footnote 1: 16-byte atomics on x86

No x86 hardware documents support for 16-byte atomic load/store with SSE instructions. In practice many modern CPUs do have atomic movaps load/store, but there are no guarantees of this in Intel/AMD manuals the way there are for 8-byte x87/MMX/SSE loads/stores on Pentium and later. And no way to detect which CPUs do/don't have atomic 128-bit ops (other than lock cmpxchg16b), so compiler writers can't safely use them.

See SSE instructions: which CPUs can do atomic 16B memory operations? for a nasty corner case: testing on K10 shows that aligned xmm load/store shows no tearing between threads on the same socket, but threads on different sockets experience rare tearing because HyperTransport apparently only gives the minimum x86 atomicity guarantee of 8 byte objects. (IDK if lock cmpxchg16b is more expensive on a system like that.)

Without published guarantees from vendors, we can never be sure about weird microarchitectural corner cases, either. Lack of tearing in a simple test with one thread writing patterns and the other reading is pretty good evidence, but it's always possible that something could be different in some special case the CPU designers decided to handle a different way than normal.


A pointer + counter struct where read-only access only needs the pointer can be cheap, but current compilers need union hacks to get them to do an 8-byte atomic load of just the first half of the object. How can I implement ABA counter with c++11 CAS?. For an ABA counter, you'd normally update it with a CAS anyway, so lack of a 16-byte atomic pure store is not a problem.

An ILP32 ABI (32-bit pointers) in 64-bit mode (like Linux's x32 ABI, or AArch64's ILP32 ABI) means pointer+integer can fit in only 8 bytes, but integer registers are still 8 bytes wide. This makes it much more efficient to use a pointer+counter atomic object than in full 64-bit mode where a pointer is 8 bytes.


Footnote 2: address-free

I think the term "address-free" is a separate claim from not depending on any per-process state. As I understand it, it means that correctness doesn't depend on both threads using the same address for the same memory location. But if correctness also depends on them sharing the same global hash table (IDK why storing the address of an object in the object itself would ever help), that would only matter if it was possible to have multiple addresses for the same object within the same process. That is possible on something like x86's real-mode segmentation model, where a 20-bit linear address space is addressed with 32-bit segment:offset. (Actual C implementations for 16-bit x86 exposed segmentation to the programmer; hiding it behind C's rules would be possible but not high performance.)

It's also possible with virtual memory: two mappings of the same physical page to different virtual addresses within the same process is possible but weird. That might or might not use the same lock, depending on whether the hash function uses any address bits above the page offset. (The low bits of an address, that represent the offset within a page, are the same for every mapping. i.e. virtual to physical translation for those bits is a no-op, which is why <


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...