Yes, some recent versions of GCC (e.g. 4.9 in march 2015) are able to issue some PREFETCH
instruction when optimizing with -O3
(even without any explicit __builtin_prefetch
)
We don't know what get_neighbor
is doing, and what are the types of v
and neigh_val
.
And prefetching is not always profitable. Adding explicit __builtin_prefetch
can slow down your code. You need to measure.
As Retired Ninja commented, prefetching in one loop and hoping data would be cached in the following loop (further down in your source code) is wrong.
You might perhaps try instead
for (size_t i = 0; i < v.get_num_edges(); i++) {
fg::vertex_id_t id = v.get_neighbor(i);
__builtin_prefetch (neigh_val[v.get_neighbor(i+4)]);
res += neigh_vals[id];
}
You could empirically replace the 4
with whatever appropriate constant is the best.
But I guess that the __builtin_prefetch
above is useless (since the compiler is probably able to add it by itself) and it could harm (or even crash the program, when computing its argument gives undefined behavior, e.g. if v.get_neighbor(i+4)
is undefined; however prefetching an address outside of your address space won't harm -but could slow down your program). Please benchmark.
See this answer to a related question.
Notice that in C++ all of []
, get_neighbor
could be overloaded and becomes very complex operations, so we cannot guess!
And there are cases where the hardware is limiting performance, whatever __builtin_prefetch
you add (and adding them could hurt performance)
BTW, you might pass -O3 -mtune=native -fdump-tree-ssa -S -fverbose-asm
to understand more what the compiler is doing (and look inside generated dump files and assembler files); also, it does happen that -O3
produces slightly slower code than what -O2
gives.
You could consider explicit multithreading, OpenMP, OpenCL if you have time to waste on optimization. Remember that premature optimization is evil. Did you benchmark, did you profile your entire application?
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…