update: I see you're using GNU C's native vector syntax, not Intel intrinsics. Are you avoiding Intel intrinsics for portability to non-x86? gcc currently does a bad job compiling code that uses GNU C vectors wider than the target machine supports. (You'd hope that it would just use two 128b vectors and operate on each separately, but apparently it's worse than that.)
Anyway, this answer shows how you can use Intel x86 intrinsics to load data into GNU C vector-syntax types
First of all, looking at compiler output at less than -O2
is a waste of time if you're trying to learn anything about what will compile to good code. Your main()
will optimize to just a ret
at -O2.
Besides that, it's not totally surprising that you get bad asm from assigning elements of a vector one at a time.
Aside: normal people would call the type v4df
(vector of 4 Double Float) or something, not vector
, so they don't go insane when using it with C++ std::vector
. For single-precision, v8sf
. IIRC, gcc uses type names like this internally for __m256d
.
On x86, Intel intrinsic types (like __m256d
) are implemented on top of GNU C vector syntax (which is why you can do v1 * v2
in GNU C instead of writing _mm256_mul_pd(v1, v2)
). You can convert freely from __m256d
to v4df
, like I've done here.
I've wrapped both sane ways to do this in functions, so we can look at their asm. Notice how we're not loading from an array that we define inside the same function, so the compiler won't optimize it away.
I put them on the Godbolt compiler explorer so you can look at the asm with various compile options and compiler versions.
typedef double v4df __attribute__((vector_size(4 * sizeof(double))));
#include <immintrin.h>
// note the return types. gcc6.1 compiles with no warnings, even at -Wall -Wextra
v4df load_4_doubles_intel(const double *p) { return _mm256_loadu_pd(p); }
vmovupd ymm0, YMMWORD PTR [rdi] # tmp89,* p
ret
v4df avx_constant() { return _mm256_setr_pd( 1.0, 2.0, 3.0, 4.0 ); }
vmovapd ymm0, YMMWORD PTR .LC0[rip]
ret
If the args to _mm_set*
intrinsics aren't compile-time constants, the compiler will do the best it can to make efficient code to get all the elements into a single vector. It's usually best to do that rather than writing C that stores to a tmp array and loads from it, because that's not always the best strategy. (Store-forwarding failure on multiple narrow stores forwarding to a wide load costs an extra ~10 cycles (IIRC) of latency on top of the usual store-forwarding delay. If your double
s are already in registers, it's usually best to just shuffle them together.)
See also Is it possible to cast floats directly to __m128 if they are 16 byte alligned? for a list of the various intrinsics for getting a single scalar into a vector. The x86 tag wiki has links to Intel's manuals, and their intrinsics finder.
Load/store GNU C vectors without Intel intrinsics:
I'm not sure how you're "supposed" to do that. This Q&A suggests casting a pointer to the memory you want to load, and using a vector type like typedef char __attribute__ ((vector_size (16),aligned (1))) unaligned_byte16;
(note the aligned(1)
attribute).
You get a segfault from *(v4df *)a
because presumably a
isn't aligned on a 32-byte boundary, but you're using a vector type that does assume natural alignment. (Just like __m256d
if you dereference a pointer to it instead of using load/store intrinsics to communicate alignment info to the compiler.)