There are two big inefficiencies in your loop that are immediately apparent:
(1) these two chunks of scalar code:
__declspec(align(32)) double ar[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
__m256d y = _mm256_load_pd(ar);
__declspec(align(32)) double arr[4] = { xb[i].x, xb[i + 1].x, xb[i + 2].x, xb[i + 3].x };
__m256d w = _mm256_load_pd(arr);
should be implemented using SIMD loads and shuffles (or at the very least use _mm256_set_pd
and give the compiler a chance to do a half-reasonable job of generating code for a gathered load).
(2) the horizontal summation at the end of the loop:
for (int i = 0; i < n; i++)
__m256d xy = _mm256_mul_pd(x, y);
__m256d zw = _mm256_mul_pd(z, w);
__m256d temp = _mm256_hadd_pd(xy, zw);
__m128d hi128 = _mm256_extractf128_pd(temp, 1);
__m128d low128 = _mm256_extractf128_pd(temp, 0);
//__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
__m128d dotproduct = _mm_add_pd(low128, hi128);
sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];
i += 3;
should be moved out of the loop:
__m256d xy = _mm256_setzero_pd();
__m256d zw = _mm256_setzero_pd();
for (int i = 0; i < n; i++)
xy = _mm256_add_pd(xy, _mm256_mul_pd(x, y));
zw = _mm256_add_pd(zw, _mm256_mul_pd(z, w));
i += 3;
__m256d temp = _mm256_hadd_pd(xy, zw);
__m128d hi128 = _mm256_extractf128_pd(temp, 1);
__m128d low128 = _mm256_extractf128_pd(temp, 0);
//__m128d dotproduct = _mm_add_pd((__m128d)temp, hi128);
__m128d dotproduct = _mm_add_pd(low128, hi128);
sum += dotproduct.m128d_f64[0]+dotproduct.m128d_f64[1];