Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
831 views
in Technique[技术] by (71.8m points)

assembly - Assembler task : translations

Here given fragments of the source code and their translation

Why.L21 (which is longer than .L27) is faster than .L27?

Why flag -funroll-loops speeds up loop1 but doesn't speed up loop2?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Anyway, about your question... the first loop doesn't have dependency inside the loop, i.e. each iteration of loop is independent and can be calculated ASAP (actually all except the last iteration can be just thrown away, because they don't affect return value at all).

The second loop for each iteration depends on previous result, so the CPU has to wait with each next imul until the previous result is ready. The imul on modern x86 has still throughput about 1.0 I guess, but the latency may be above 1.0 and not sure what the dependency will do, depends completely on your target CPU platform, which you didn't specify. (somebody like Peter Cordes can surely answer this for particular modern Intel micro architectures, or you can read yourself Agner's tables, but as you didn't specify target architecture, I don't see point in making any particular real world example, for me this general chit-chat level is enough)

For example on 80386 I guess the second loop would be faster, because it has less instructions, and 80386 was still quite "simple" inside, with imul taking several clocks in either case. On latest Intel CPUs the dependency will probably just so-so skew it in favour of first one, but not much, as imul is reasonably fast today.

Anyway, this is nice example how sorting out your algorithm first, and tuning that, will give you the biggest performance gain, as the first loop is not a real loop, and writing it as simple formula will make the code even faster.

Curiously enough, I tried in godbolt explorer, what modern compilers do about it, and gcc does some quite convoluted thing to read through each array member, or what exactly does that wall of instructions do (too lazy to check in debugger), while clang compiler does see through it and produces the simplified formula instead: https://godbolt.org/g/p2MGHs

P.S. the first loop can be simplified down to:

int loop1_fix(int *a, int x, int n) {
    if (0 < n) return a[n-1]*x*x*x*x;
    else return x*x*x;
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...