performance - Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

Question

Welcome To Ask or Share your Answers For Others

performance - Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:50:15+0000

You're correct that if your whole program doesn't use any non-VEX instructions that write xmm registers, you don't need vzeroupper to avoid state-transition penalties.

Beware that non-VEX instructions can lurk in CRT startup code and/or the dynamic linker, or other highly non-obvious places.

That said, a non-VEX instruction can only cause a one-time penalty when it runs. The reverse isn't true: one VEX-256 instruction can make non-VEX instructions in general (or just with that register) slow for the rest of the program.

There's no penalty when mixing VEX and EVEX, so no need to use vzeroupper there.

On Skylake-AVX512: vzeroupper or vzeroall are the only way to restore max-turbo after dirtying a ZMM register, assuming your program still uses any SSE*, AVX1, or AVX2 instructions on xmm/ymm0..15.

See also Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask? - merely reading a zmm doesn't cause this.

Posted by @BeeOnRope in chat:

There is a new, pretty bad effect with AVX-512 instructions on surrounding code: once a 512-bit instruction is executed (except perhaps for instructions that don't write to a zmm register) the core enters an "upper 256 dirty state". In this state, any later scalar FP/SSE/AVX instruction (anything using xmm or ymm regs) will internally be extended to 512 bits. This means the processor will be locked to no higher than the AVX turbo (the so-called "L1 license") until vzeroupper or vzeroall are issued.

Unlike the earlier "dirty upper 128" issue with AVX and legacy non-VEX SSE (which still exists on Skylake Xeon), this will slow down all code due to the lower frequency, but there are no "merging uops" or false dependencies or anything like that: it's just that the smaller operations are effectively treated as 512-bit wide in order to implement the zero-extending behavior.

about "writing the low halves ..." - no, it is a global state, and only vzero gets you out of it*. It occurs even if you dirty a zmm register but use different ones for ymm and xmm. It occurs even if the only dirtying instruction is a zeroing idiom like vpxord zmm0, zmm0, zmm0. It doesn't occur for writes to zmm16-31 though.

His description of actually extending all vector ops to 512 bits isn't quite right, because he later confirmed that it doesn't reduce throughput for 128 and 256-bit instructions. But we know that when 512-bit uops are in flight, the vector ALUs on port 1 are shut down. (So the 256-bit FMA units normally accessible via ports 0 and 1 can combine into a 512-bit unit for all FP math, integer multiply, and possibly some other stuff. Some SKX Xeons have a 2nd 512-bit FMA unit on port 5, some don't.)

For max-turbo after using only AVX1 / AVX2 (including on earlier CPUs like Haswell): Opportunistically powering down the upper halves of execution units if they haven't been used for a while (and sometimes allowing higher Turbo clock speeds) depends on whether YMM instructions have been used recently, not on whether the upper halves are dirty or not. So AFAIK, vzeroupper does not help the CPU un-throttle the clock speed sooner after using AVX1 / AVX2, for CPUs where max turbo is lower for 256-bit.

This is different from Intel's Skylake-AVX512 (SKX / Skylake-SP), where AVX512 is somewhat "bolted on".

`VZEROUPPER` might make context switches slightly cheaper

because the CPU still knows whether the ymm-upper state is clean or dirty.

If it's clean, I think xsaveopt or xsavec can write out the FPU state more compactly, without storing the all-zero upper halves at all (just setting a bit that says they're clean). Notice in the state-transition diagram for SSE/AVX that xsave / xrstor is part of the picture.

An extra vzeroupper just for this is only worth considering if your code won't use any 256b instructions for a long time after this, because ideally you won't have any context switches / CPU migrations before the next use of 256-bit vectors.

This may not apply as much on AVX512 CPUs: vzeroupper / vzeroall don't touch ZMM16..31, only ZMM0..15. So you can still have lots of dirty state after vzeroall.

(Plausible in theory): Dirty upper halves may be taking up physical registers (although IDK of any evidence for this being true on any real CPUs). If so, it would limit out-of-order window size for the CPU to find instruction-level parallelism. (ROB size is the other major limiting factor, but PRF size can be the bottleneck.)

This may be true on AMD CPUs before Zen2, where 256b ops are split into two 128b ops. YMM registers are handled internally as two 128-bit registers, and e.g. vmovaps ymm0, ymm1 renames the low 128 with zero latency, but needs a uop for the upper half. (See Agner Fog's microarch pdf). It's unknown whether vzeroupper can actually drop the renaming for the upper halves, though. Zeroing idioms on AMD Zen (unlike SnB-family) still need a back-end uop to write the register value, even for the 128b low half; only mov-elimination avoids a back-end uop. So there may not be a physical zero register that uppers can be renamed onto.

Experiments in that ROB size / PRF size blog post show that FP physical register file entries are 256-bit in Sandybridge, though. vzeroupper shouldn't free up more registers on mainstream Intel CPUs with AVX/AVX2. Haswell-style transition penalties are slow enough that it probably drains the ROB to save or restore uppers to separate storage that isn't renamed, not using up valuable PRF entries.

Silvermont doesn't support AVX. And it uses a separate retirement register file for the architectural state, so the out-of-order PRF only holds speculative execution results. So even if it did support AVX with 128-bit halves, a stale YMM register with a dirty upper half probably wouldn't be using up extra space in the rename register file.

KNL (Knight's Landing / Xeon Phi) is specifically designed to run AVX512, so presumably its FP register file has 512-bit entries. It's based on Silvermont, but the SIMD parts of the core are different (e.g. it can reorder FP/vector instructions, while Silvermont can only execute them speculatively but not reorder them within the FP/vector pipeline, according to Agner Fog). Still, KNL may also use a separate retirement register file, so dirty ZMM uppers wouldn't consume extra space even if it was able to split a 512-bit entry to store two 256-bit vectors. Which is unlikely, because a larger out-of-order window for only AVX1/AVX2 on KNL wouldn't be worth spending transistors on.

vzeroupper is much slower on KNL than mainstream Intel CPUs (one per 36 cycles in 64-bit mode), so you probably wouldn't want to use, especially just for the tiny context-switch advantage.

On Skylake-AVX512, the evidence supports the conclusion that the vector physical register file is 512-bits wide.

Some future CPU might pair up entries in a physical register file to store wide vectors, even if they don't normally decode to separate uops the way AMD does for 256-bit vectors.

@Mysticial reports unexpected slowdowns in code with long FP dependency chains with YMM vs. ZMM but otherwise identical code, but later experiments disagree with the conclusion that SKX uses 2x 256-bit register file entries for ZMM registers when the upper 256 bits are dirty.

Categories

performance - Is it useful to use VZEROUPPER if your program+libraries contain no SSE instructions?