There's no difference, it's just silly redundant naming. Use _mm512_load_si512
for clarity. Thanks, Intel. As usual, it's easier to understand the underlying asm for AVX512, and then you can see what the clumsy intrinsic naming is trying to say. Or at least you can understand how we ended up with this mess of different documentation suggesting _mm512_load_epi32
vs. _mm512_load_si512
.
Almost all AVX512 instructions support merge-masking and zero-masking.
(e.g. vmovdqa32
can do a masked load like vmovdqa32 zmm0{k1}{z}, [rdi]
to zero vector elements where k1
had a zero bit), which is why different element-size versions of things like vector loads and bitwise operations exist. (e.g. vpxord
vs. vpxorq
).
But these intrinsics are for the no-masking version. The element-size is totally irrelevant. I'm guessing _mm512_load_epi32
exists for consistency with _mm512_mask_load_epi32
(merge-masking) and _mm512_maskz_load_epi32
(zero-masking). See the docs for the vmovdqa32
asm instruction.
e.g. _mm512_maskz_loadu_epi64(0x55, x)
zeros the odd elements for free while loading. (At least it's free if the cost of putting 0x55
into a k
register can be hoisted out of a loop. And if we haven't defeated the chance for the compiler to fold a load into a memory operand for an ALU instruction.)
When elements are all loaded into the destination unchanged, element boundaries are meaningless. That's why AVX2 and earlier don't have different element-size versions of bitwise booleans like _mm_xor_si128
and loads/stores like _mm_load_si128
.
Some compilers don't support the element-width names for unaligned unmasked loads. e.g. current gcc doesn't support _mm512_loadu_epi64
even though it's supported _mm512_load_epi64
since the first gcc version to support AVX512 intrinsics at all. (See error: '_mm512_loadu_epi64' was not declared in this scope)
There are no CPUs where the choice of vmovdqa64
vs. vmovdqa32
matters at all for efficiency, so there's zero point in trying to hint the compiler to use one or the other, regardless of the natural element width of your data.
Only FP vs. integer might matter for loads, and Intel's intrinsics already uses different types (__m512
vs. __m512i
) for that.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…