Use movzx
to load narrow data on modern CPUs. (Or movsx
if it's useful to have it sign-extended instead of zero-extended, but movzx
is sometimes faster and never slower.)
movzx
is only slow on the ancient P5 (original Pentium) microarchitecture, not anything made this century. Pentium-branded CPUs based on recent microarchitectures, like Pentium G3258 (Haswell, 20th anniversary edition of original Pentium) are totally different beasts, and perform like the equivalent i3 but without AVX, BMI1/2, or hyperthreading.
Don't tune modern code based on P5 guidelines / numbers. However, Knight's Corner (Xeon Phi) is based on a modified P54C microarchitecture, so perhaps it has slow movzx
as well. Neither Agner Fog nor Instlatx64 have per-instruction throughput / latency numbers for KNC.
Using a 16-bit operand size instruction doesn't switch the whole pipeline over to 16-bit mode or cause a big perf hit. See Agner Fog's microarch pdf to learn exactly what is and isn't slow on various x86 CPU microarchitectures (including ones as old as Intel P5 (original Pentium) which you seem to be talking about for some reason).
Writing a 16-bit register and then reading the full 32/64-bit register is slow on some CPU (partial-register stall when merging on Intel P6-family). On others, writing a 16-bit register merges into the old value so there's a false dependency on the old value of the full register when you write, even if you never read the full register. See which CPU does what. (Note that Haswell/Skylake only rename AH separately, unlike Sandybridge which (like Core2/Nehalem) also renames AL / AX separately from RAX, but merges without stalling.)
Unless you specifically care about in-order P5 (or possibly Knight's Corner Xeon Phi, based on the same core, but IDK if movzx
is slow there, too), USE THIS:
movzx eax, word [src1] ; as efficient as a 32-bit MOV load on most CPUs
cmp ax, word [src2]
The operand-size prefix for cmp
decodes efficiently on all modern CPUs. Reading a 16-bit register after writing the full register is always fine, and the 16-bit load for the other operand is also fine.
The operand-size prefix isn't length-changing because there's no imm16 / imm32. e.g. cmp word [src2], 0x7F
is fine (it can use a sign-extended imm8), but
cmp word [src2], 0x80
needs an imm16 and will LCP-stall on some Intel CPUs. (Without the operand-size prefix, the same opcode would have an imm32, i.e. the rest of the instruction would be a different length). Instead, use mov eax, 0x80
/ cmp word [src2], ax
.
The address-size prefix can be length-changing in 32-bit mode (disp32 vs. disp16), but we don't want to use 16-bit addressing modes to access 16-bit data. We're still using [ebx+1234]
(or rbx
), not [bx+1234]
.
On modern x86: Intel P6 / SnB-family / Atom / Silvermont, AMD since at least K7, i.e. anything made in this century, newer than actual P5 Pentium, movzx
loads are very efficient.
On many CPUs, the load ports directly support movzx
(and sometimes also movsx
), so it runs as just a load uop, not as a load + ALU.
Data from Agner Fog's instruction-set tables: Note they may not cover every corner case, e.g. mov
-load numbers might only be for 32 / 64-bit loads. Also note that Agner Fog's load latency numbers are not load-use latency from L1D cache; they only make sense as part of the store/reload (store-forwarding) latency, but relative numbers will tell us how many cycles movzx
adds on top of mov
(often no extra cycles).
(Update: https://uops.info/ has better test results that actually reflect load-use latency, and they're automated so typos and clerical errors in updating the spreadsheets aren't a problem. But uops.info only goes back to Conroe (first-gen Core 2) for Intel, and only Zen for AMD.)
P5 Pentium (in-order execution): movzx
-load is a 3-cycle instruction (plus a decode bottleneck from the 0F
prefix), vs. mov
-loads being single cycle throughput. (They still have latency, though).
Intel:
PPro / Pentium II / III: movzx
/movsx
run on just a load port, same throughput as plain mov
.
Core2 / Nehalem: same, including 64-bit movsxd
, except on Core 2 where a movsxd r64, m32
load costs a load + ALU uop, which don't micro-fuse.
Sandybridge-family (SnB through Skylake and later): movzx
/movsx
loads are single-uop (just a load port), and perform identically to mov
loads.
Pentium4 (netburst): movzx
runs on the load port only, same perf as mov
. movsx
is load + ALU, and takes 1 extra cycle.
Atom (in-order): Agner's table is unclear for memory-source movzx
/movsx
needing an ALU, but they're definitely fast. The latency number is only for reg,reg.
Silvermont: same as Atom: fast but unclear on needing a port.
KNL (based on Silvermont): Agner lists movzx
/movsx
with a memory source as using IP0 (ALU), but latency is the same as mov r,m
so there's no penalty. (execution-unit pressure is not a problem because KNL's decoders can barely keep its 2 ALUs fed anyway.)
AMD:
Bobcat: movzx
/movsx
loads are 1 per clock, 5 cycle latency. mov
-load is 4c latency.
Jaguar: movzx
/movsx
loads are 1 per clock, 4 cycle latency. mov
loads are 1 per clock, 3c latency for 32/64-bit, or 4c for mov r8/r16, m
(but still only an AGU port, not an ALU merge like Haswell/Skylake do).
K7/K8/K10: movzx
/movsx
loads have 2-per-clock throughput, latency 1 cycle higher than a mov
load. They use an AGU and an ALU.
Bulldozer-family: same as K10, but movsx
-load has 5 cycle latency. movzx
-load has 4 cycle latency, mov
-load has 3 cycle latency. So in theory it might be lower latency to mov cx, word [mem]
and then movsx eax, cx
(1 cycle), if the false dependency from a 16-bit mov
load doesn't require an extra ALU merge, or create a loop-carried dependency for your loop.
Ryzen: movzx
/movsx
loads run in the load port only, same latency as mov
loads.
VIA
Via Nano 2000/3000: movzx
runs on the load port only, same latency as mov
loads. movsx
is LD + ALU, with 1c extra latency.
When I say "perform identically", I mean not counting any partial-register penalties or cache-line splits from a wider load. e.g. a movzx eax, word [rsi]
avoids a merging penalty vs mov ax, word [rsi]
on Skylake, but I'll still say that mov
performs identically to movzx
. (I guess I mean that mov eax, dword [rsi]
without any cache-line splits is as fast as movzx eax, word [rsi]
.)
xor
-zeroing the full register before writing a 16-bit register avoids a later partial-register merging stall on Intel P6-family, as well as breaking false dependencies.
If you want to run well on P5 as well, this might be somewhat better there while not being much worse on any modern CPUs except PPro to PIII where xor
-zeroing isn't dep-breaking, even though it is still recognized as a zeroing-idiom making EAX equivalent to AX (no partial-register stall when reading EAX after writing AL or AX).
;; Probably not a good idea, maybe not faster on anything.
;mov eax, 0 ; some code tuned for PIII used *both* this and xor-zeroing.
xor eax, eax ; *not* dep-breaking on early P6 (up to PIII)
mov ax, word [src1]
cmp ax, word [src2]
; safe to read EAX without partial-reg stalls
The operand-size prefix isn't ideal for P5, so you could consider using a 32-bit load if you're sure it doesn't fault, cross a cache-line boundary, or cause a store-forwarding failure from a recent 16-bit store.
Actually, I think a 16-bit mov
load might be slower on Pentium than the movzx
/cmp
2 instruction sequence. There really doesn't seem to be a good option for working with 16-bit data as efficiently as 32-bit! (Other than packed MMX stuff, of course).
See Agner Fog's guide for the Pentium details, but the operand-size prefix takes an extra 2 cycles to decode on P1 (original P5) and PMMX, so this sequence may actually be worse than a movzx
load. On P1 (but not PMMX), the 0F
escape byte (used by movzx
) also counts as a prefix, taking an extra cycle to decode.
Apparently movzx
isn't pairable anyway. Multi-cycle movzx
will hide the decode latency of cmp ax, [src2]
, so movzx
/ cmp
is probably still the best choice. Or schedule instructions so the movzx
is done earlier and the cmp
can maybe pair with something. Anyway, the scheduling rules are quite complicated for P1/PMMX.
I timed this loop on Core2 (Conroe) to prove that xor-zeroing avoids partial register stalls for 16-bit registers as well as low-8 (like for setcc al
):
mov ebp, 100000000
ALIGN 32
.loop:
%rep 4
xor eax, eax
; mov eax, 1234 ; just break dep on the old value, not a zeroing idiom
mov ax, cx ; write AX
mov edx, eax ; read EAX
%endrep
dec ebp ; Core2 can't fuse dec / jcc even in 32-bit mode
jg .loop ; but SnB does
perf stat -r4 ./testloop
output for this in a static binary that makes a sys_exit system call after :
;; Core2 (Conroe) with XOR eax, eax
469,277,071 cycles # 2.396 GHz
1,400,878,601 instructions # 2.98 insns per cycle
100,156,594 branches