XorShift is indeed a good choice. It is so good, so fast and requires so little state that I'm surprised to see so little adoption. It should be the standard generator on all platforms. I have implemented it myself 8 years ago and even then it could generate 800MB/s of random bytes.
You cannot use vector instructions to speed up generating a single random number. There is too little instruction-level parallelism in those few instructions.
But you can easily speed up generating N numbers where N is the vector size of your target instruction set. Just run N generators in parallel. Keep state for N generators and generate N numbers at the same time.
If client code demands numbers one at a time you could keep a buffer of N (or more) numbers. If the buffer is empty you fill it using vector instructions. If the buffer is not empty you just return the next number.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…