Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
3.8k views
in Technique[技术] by (71.8m points)

c - Is there a benefit (other than convenience) to ARM NEON array types?

ARM NEON C intrinsics define array types (per docs here) which are C structures containing the arrays of vector data types. For example, int16x4x2_t is two int16x4_t's.

My question is quite simple: why do these exist? Do they confer some performance advantage versus working directly with the vector types? For example, if I vld1q_f64 a float64x2_t from a double pointer and use vaddq_f64, vmulq_f64, vfmaq_f64, ... etc would I expect any performance difference to vld1q_f64_x4-ing a float64x2x4_t and using the same functions on elements of the array. I guess I'm simply puzzled why they created these fixed length array types when the only intrinsics which directly operate on them are loads and stores.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

They aren't just used for loads and stores; they are used when there are multiple 128-bit output vectors. See, for example, vuzp_* / vuzpq_*.

It's a nice way of wrapping up operations which require multiple vectors without resorting to pointers, which are easy to mess up. This includes loads and stores which load/store multiple vectors in a single operation. For example, it would probably have been possible to implement the vld4_s8 function as something like:

void vld4_s8(int8x8_t* a, int8x8_t* b, int8x8_t* c, int8x8_t* d, int8_t* data);

If I saw that signature I would have to find some documentation to figure out WTF was going on with the first four parameters. Are they inputs or outputs? Are they pointers to arrays of vectors, or just a single vector? Do they all need to be set, or can/should I pass NULL if I don't need the value?

The current function, on the other hand, is hard to get wrong. There is an input which is an array of 8-bit integers, and it returns an array of four vectors. The only thing that I would have done different would be to use a conformant array parameter for the input data so you know exactly how many elements it needs, but that won't work in C++ or MSVC (possibly until a few months ago when they added C99/C11 support) anyways.

would I expect any performance difference to vld1q_f64_x4-ing a float64x2x4_t and using the same functions on elements of the array.

Possibly, but not from the multiply. The advantage comes from using a single instruction to load all four vectors; see https://godbolt.org/z/aY8a95 for example. In this case, you have one load and four multiplies instead of four loads and four multiplies.

In practice, I suspect that most good compilers would be able to optimize this to the same code (though I would want to check), but intrinsics typically map 1:1 with instructions so you don't have to trust the compiler to be smart. In this case, there is an ld4 instruction so there is a vld4_* family of functions, and these types are used in those APIs.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...