The answer depends on the ARM CPU. The Cortex-A8, for example, uses a coprocessor to implement the NEON and VFP instructions, which is connected to the ARM core via a FIFO. When the instruction decoder detects a NEON or VFP instruction, it simply places it into the fifo. The NEON coprocessor fetches instructions from the FIFO and executes them. The NEON/VFP coprocessor thus lags behind a bit - on the Cortext-A8 up to 20 cycles or so.
Usually, that delay doesn't care about that delay, unless you attempt to transfer data back from the NEON/VFP coprocessor to the main ARM core. (It doesn't matter much whether you do that by moving from a NEON/VPF into an ARM register, or by reading memory using ARM instructions that has recently been written to by NEON instructions). In that case, the main ARM core is stalled until the NEON core has emptied the FIFO, i.e. up to 20 cycles or so.
The ARM core can usually enqueue NEON/VPF instructions faster than the NEON/VPF coprocessor can execute them. You can exploit that to have both cores work in parallel by suitable interleaving your instructions. E.g., insert one ARM instruction after every block of two or three NEON instructions. Or maybe two ARM instructions if you also want to exploit ARM's dual-issue capability. You will have to use inline assembly to do this - if you use intrinsics, the exact scheduling of the instructions is up to the compiler, and whether it has the smarts to interleave them suitably is anybody's guess.
Your code will look something like
<neon instruction>
<neon instruction>
<neon instruction>
<arm instruction>
<arm instruction>
<neon instruction>
...
I don't have a code sample at hand, but if you're somewhat familiar with ARM assembly, interleaving the instructions shouldn't be much of a challenge. After you're done, be sure to use an instruction-level profiler to check that things actually work as intended. You should see virtually no time spent on the ARM instructions.
Remember that other ARMv7 implementations might implement NEON completely different. It seems, for example, that the Cortex A-9 has moved NEON closer to the ARM core, and has a much lower penatly on data movements from NEON/VFP back to ARM. Whether or not this affects parallel scheduling of instructions I do not know, but it's definitely something to watch out for.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…