But will my recent stores be visible to subsequent load instructions too?
This sentence makes little sense. Loads are the only way any thread can see the contents of memory. Not sure why you say "too", since there's nothing else. (Other than DMA reads by non-CPU system devices.)
The definition of a store becoming globally visible is that loads in any other thread will get the data from it. It means that the store has left the CPU's private store-buffer and is part of the coherency domain that includes the data caches of all CPUs. (https://en.wikipedia.org/wiki/Cache_coherence).
CPUs always try to commit stores from their store buffer to the globally visible cache/memory state as quickly as possible. All you can do with barriers is make this thread wait until that happens before doing later operations. That can certainly be necessary in multithreaded programs with streaming stores, and it looks like that's what you're actually asking about. But I think it's important to understand that NT stores do reliably become visible to other threads very quickly even with no synchronization.
A mutex unlock on x86 is sometimes a lock add
, in which case that's a full fence for NT stores already. But if you can't rule out a mutex implementation using a simple store then you need at least sfence
.
Normal x86 stores have release memory-ordering semantics (C++11 std::memory_order_release
). MOVNT streaming stores have relaxed ordering, but mutex / spinlock functions, and compiler support for C++11 std::atomic, basically ignores them. For multi-threaded code, you have to fence them yourself to avoid breaking the synchronization behaviour of mutex / locking library functions, because they only synchronize normal x86 strongly-ordered loads and stores.
Loads in the thread that executed the stores will still always see most recently stored value, even from movnt
stores. You never need fences in a single-threaded program. The cardinal rule of out-of-order execution and memory reordering is that it never breaks the illusion of running in program order within a single thread. Same thing for compile-time reordering: since concurrent read/write access to shared data is C++ Undefined Behaviour, compilers only have to preserve single-threaded behaviour unless you use fences to limit compile-time reordering.
MOVNT + SFENCE is useful in cases like producer-consumer multi-threading, or with normal locking where the unlock of a spinlock is just a release-store.
A producer thread writes a big buffer with streaming stores, then stores "true" (or the address of the buffer, or whatever) into a shared flag variable. (Jeff Preshing calls this a payload + guard variable).
A consumer thread is spinning on that synchronization variable, and starts reading the buffer after seeing it become true.
The producer must use sfence after writing the buffer, but before writing the flag, to make sure all the stores into the buffer are globally visible before the flag. (But remember, NT stores are still always locally visible right away to the current thread.)
(With a locking library function, the flag being stored to is the lock. Other threads trying to acquire the lock are using acquire-loads.)
std::atomic <bool> buffer_ready;
producer() {
for(...) {
_mm256_stream_si256(buffer);
}
_mm_sfence();
buffer_ready.store(true, std::memory_order_release);
}
The asm would be something like
vmovntdqa [buf], ymm0
...
sfence
mov byte [buffer_ready], 1
Without sfence
, some of the movnt
stores could be delayed until after the flag store, violating the release semantics of the normal non-NT store.
If you know what hardware you're running on, and you know the buffer is always large, you might get away with skipping the sfence
if you know the consumer always reads the buffer from front to back (in the same order it was written), so it's probably not possible for the stores to the end of the buffer to still be in-flight in a store buffer in the core of the CPU running the producer thread by the time the consumer thread gets to the end of the buffer.
(in comments)
by "subsequent" I mean happening later in time.
There's no way to make this happen unless you limit when those loads can be executed, by using something that synchronizes the producer thread with the consumer. As worded, you're asking for sfence
to make NT stores globally visible the instant it executes, so that loads on other cores that execute 1 clock cycle after sfence
will see the stores. A sane definition of "subsequent" would be "in the next thread that takes the lock this thread currently holds".
Fences stronger than sfence
work, too:
Any atomic read-modify-write operation on x86 needs a lock
prefix, which is a full memory barrier (like mfence
).
So if you for example increment an atomic counter after your streaming stores, you don't also need sfence
. Unfortunately, in C++ std:atomic
and _mm_sfence()
don't know about each other, and compilers are allowed to optimize atomics following the as-if rule. So it's hard to be sure that a lock
ed RMW instruction will be in exactly the place you need it in the resulting asm.
(Basically, if a certain ordering is possible in the C++ abstract machine, the compiler can emit asm that makes it always happen that way. e.g. fold two successive increments into one +=2
so that no thread can ever observe the counter being an odd number.)
Still, the default mo_seq_cst
prevents a lot of compile-time reordering, and there's not much downside to using it for a read-modify-write operation when you're only targeting x86. sfence
is quite cheap, though, so it's probably not worth the effort trying to avoid it between some streaming stores and an lock
ed operation.
Related: pthreads v. SSE weak memory ordering. The asker of that question thought that unlocking a lock would always do a lock
ed operation, thus making sfence
redundant.
C++ compilers don't try to insert sfence
for you after streaming stores, even when there are std::atomic
operations with ordering stronger than relaxed
. It would be too hard for compilers to reliably get this right without being very conservative (e.g. sfence
at the end of every function with an NT store, in case the caller uses atomics).
The Intel intrinsics predate C11 stdatomic
and C++11 std::atomic
.
The implementation of std::atomic
pretends that weakly-ordered stores didn't exist, so you have to fence them yourself with intrinsics.
This seems like a good design choice, since you only want to use movnt
stores in special cases, because of their cache-evicting behaviour. You don't want the compiler ever inserting sfence
where it wasn't needed, or using movnti
for std::memory_order_relaxed
.