c++ - Why does tree vectorization make this sorting algorithm 2x slower?

Question

Welcome To Ask or Share your Answers For Others

c++ - Why does tree vectorization make this sorting algorithm 2x slower?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c++ - Why does tree vectorization make this sorting algorithm 2x slower?

The sorting algorithm of this question becomes twice faster(!) if -fprofile-arcs is enabled in gcc (4.7.2). The heavily simplified C code of that question (it turned out that I can initialize the array with all zeros, the weird performance behavior remains but it makes the reasoning much much simpler):

#include <time.h>
#include <stdio.h>

#define ELEMENTS 100000

int main() {
  int a[ELEMENTS] = { 0 };
  clock_t start = clock();
  for (int i = 0; i < ELEMENTS; ++i) {
    int lowerElementIndex = i;
    for (int j = i+1; j < ELEMENTS; ++j) {
      if (a[j] < a[lowerElementIndex]) {
        lowerElementIndex = j;
      }
    }
    int tmp = a[i];
    a[i] = a[lowerElementIndex];
    a[lowerElementIndex] = tmp;
  } 
  clock_t end = clock();
  float timeExec = (float)(end - start) / CLOCKS_PER_SEC;
  printf("Time: %2.3f
", timeExec);
  printf("ignore this line %d
", a[ELEMENTS-1]);
}

After playing with the optimization flags for a long while, it turned out that -ftree-vectorize also yields this weird behavior so we can take -fprofile-arcs out of the question. After profiling with perf I have found that the only relevant difference is:

Fast case gcc -std=c99 -O2 simp.c (runs in 3.1s)

    cmpl    %esi, %ecx
    jge .L3
    movl    %ecx, %esi
    movslq  %edx, %rdi
.L3:

Slow case gcc -std=c99 -O2 -ftree-vectorize simp.c (runs in 6.1s)

    cmpl    %ecx, %esi
    cmovl   %edx, %edi
    cmovl   %esi, %ecx

As for the first snippet: Given that the array only contains zeros, we always jump to .L3. It can greatly benefit from branch prediction.

I guess the cmovl instructions cannot benefit from branch prediction.

Questions:

Are all my above guesses correct? Does this make the algorithm slow?
If yes, how can I prevent gcc from emitting this instruction (other than the trivial -fno-tree-vectorization workaround of course) but still doing as much optimizations as possible?
What is this -ftree-vectorization? The documentation is quite vague, I would need a little more explanation to understand what's happening.

Update: Since it came up in comments: The weird performance behavior w.r.t. the -ftree-vectorize flag remains with random data. As Yakk points out, for selection sort, it is actually hard to create a dataset that would result in a lot of branch mispredictions.

Since it also came up: I have a Core i5 CPU.

Based on Yakk's comment, I created a test. The code below (online without boost) is of course no longer a sorting algorithm; I only took out the inner loop. Its only goal is to examine the effect of branch prediction: We skip the if branch in the for loop with probability p.

#include <algorithm>
#include <cstdio>
#include <random>
#include <boost/chrono.hpp>
using namespace std;
using namespace boost::chrono;
constexpr int ELEMENTS=1e+8; 
constexpr double p = 0.50;

int main() {
  printf("p = %.2f
", p);
  int* a = new int[ELEMENTS];
  mt19937 mt(1759);
  bernoulli_distribution rnd(p);
  for (int i = 0 ; i < ELEMENTS; ++i){
    a[i] = rnd(mt)? i : -i;
  }
  auto start = high_resolution_clock::now();
  int lowerElementIndex = 0;
  for (int i=0; i<ELEMENTS; ++i) {
    if (a[i] < a[lowerElementIndex]) {
      lowerElementIndex = i;
    }
  } 
  auto finish = high_resolution_clock::now();
  printf("%ld  ms
", duration_cast<milliseconds>(finish-start).count());
  printf("Ignore this line   %d
", a[lowerElementIndex]);
  delete[] a;
}

The loops of interest:

This will be referred to as cmov

g++ -std=c++11 -O2 -lboost_chrono -lboost_system -lrt branch3.cpp

    xorl    %eax, %eax
.L30:
    movl    (%rbx,%rbp,4), %edx
    cmpl    %edx, (%rbx,%rax,4)
    movslq  %eax, %rdx
    cmovl   %rdx, %rbp
    addq    $1, %rax
    cmpq    $100000000, %rax
    jne .L30

This will be referred to as no cmov, the -fno-if-conversion flag was pointed out by Turix in his answer.

g++ -std=c++11 -O2 -fno-if-conversion -lboost_chrono -lboost_system -lrt branch3.cpp

    xorl    %eax, %eax
.L29:
    movl    (%rbx,%rbp,4), %edx
    cmpl    %edx, (%rbx,%rax,4)
    jge .L28
    movslq  %eax, %rbp
.L28:
    addq    $1, %rax
    cmpq    $100000000, %rax
    jne .L29

The difference side by side

cmpl    %edx, (%rbx,%rax,4) |     cmpl  %edx, (%rbx,%rax,4)
movslq  %eax, %rdx          |     jge   .L28
cmovl   %rdx, %rbp          |     movslq    %eax, %rbp
                            | .L28:

The execution time as a function of the Bernoulli parameter p

effect of branch prediction

The code with the cmov instruction is absolutely insensitive to p. The code without the cmov instruction is the winner if p<0.26 or 0.81<p and is at most 4.38x faster (p=1). Of course, the worse situation for the branch predictor is at around p=0.5 where the code is 1.58x slower than the code with the cmov instruction.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:19:15+0000

Note: Answered before graph update was added to the question; some assembly code references here may be obsolete.

(Adapted and extended from our above chat, which was stimulating enough to cause me to do a bit more research.)

First (as per our above chat), it appears that the answer to your first question is "yes". In the vector "optimized" code, the optimization (negatively) affecting performance is branch predication, whereas in the original code the performance is (positively) affected by branch prediction. (Note the extra 'a' in the former.)

Re your 3rd question: Even though in your case, there is actually no vectorization being done, from step 11 ("Conditional Execution") here it appears that one of the steps associated with vectorization optimizations is to "flatten" conditionals within targeted loops, like this bit in your loop:

if (a[j] < a[lowerElementIndex]
    lowerElementIndex = j;

Apparently, this happens even if there is no vectorization.

This explains why the compiler is using the conditional move instructions (cmovl). The goal there is to avoid a branch entirely (as opposed to trying to predict it correctly). Instead, the two cmovl instructions will be sent down the pipeline before the result of the previous cmpl is known and the comparison result will then be "forwarded" to enable/prevent the moves prior to their writeback (i.e., prior to them actually taking effect).

Note that if the loop had been vectorized, this might have been worth it to get to the point where multiple iterations through the loop could effectively be accomplished in parallel.

However, in your case, the attempt at optimization actually backfires because in the flattened loop, the two conditional moves are sent through the pipeline every single time through the loop. This in itself might not be so bad either, except that there is a RAW data hazard that causes the second move (cmovl %esi, %ecx) to have to wait until the array/memory access (movl (%rsp,%rsi,4), %esi) is completed, even if the result is going to be ultimately ignored. Hence the huge time spent on that particular cmovl. (I would expect this is an issue with your processor not having complex enough logic built into its predication/forwarding implementation to deal with the hazard.)

On the other hand, in the non-optimized case, as you rightly figured out, branch prediction can help to avoid having to wait on the result of the corresponding array/memory access there (the movl (%rsp,%rcx,4), %ecx instruction). In that case, when the processor correctly predicts a taken branch (which for an all-0 array will be every single time, but [even] in a random array should [still] be ~~roughly~~more than [edited per @Yakk's comment] half the time), it does not have to wait for the memory access to finish to go ahead and queue up the next few instructions in the loop. So in correct predictions, you get a boost, whereas in incorrect predictions, the result is no worse than in the "optimized" case and, furthermore, better because of the ability to sometimes avoid having the 2 "wasted" cmovl instructions in the pipeline.

[The following was removed due to my mistaken assumption about your processor per your comment.]
Back to your questions, I would suggest looking at that link above for more on the flags relevant to vectorization, but in the end, I'm pretty sure that it's fine to ignore that optimization given that your Celeron isn't capable of using it (in this context) anyway.

[Added after above was removed]
Re your second question ("...how can I prevent gcc from emitting this instruction..."), you could try the -fno-if-conversion and -fno-if-conversion2 flags (not sure if these always work -- they no longer work on my mac), although I do not think your problem is with the cmovl instruction in general (i.e., I wouldn't always use those flags), just with its use in this particular context (where branch prediction is going to be very helpful given @Yakk's point about your sort algorithm).

Categories

c++ - Why does tree vectorization make this sorting algorithm 2x slower?

c++ - Why does tree vectorization make this sorting algorithm 2x slower?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags