Let's say we have a NN with multiple layers. Say a simple MLP (Multi-Level Perceptron) that has GEMM1 -> Activation1 -> GEMM2 -> Activation2
. Now, let's say we are doing inference and we are using int8
as the precision of the data and the weights.
A GEMM layer involves accumulation. Generally, accumulation is done in 32bits. So, the output of GEMM1 has all elements in int32
. Now before we start Activation1, we will need to convert them from 32bits
to 8bits
. Maybe we don't. So, then we will do Activation1 in 32-bits
. But at some point, we need to come back to 8 bits, say before starting GEMM2.
My question is: how is the conversion from int32
to int8
done? There are two things that come to my mind: rounding and quantization. There are many rounding methods (simple, convergent, nearest, etc), but in this case, it doesn't seem like rounding because it's not like we are losing a little bit of precision; we are losing 24 bits. For quantization, we basically take the entire range of numbers in the int32
output matrix and then map it to an 8-bit
range. But we need to know the full output matrix before we can do this. We can't do it on an element by element level.
I use int in the text above, but I think the fixed point is the same from a rounding/quantization perspective. The floating-point is different. And it makes sense that people like BFloat16
(over IEEE half-precision/FP16) because it has the same range as FP32
. So, when converting the output of GEMM1 from IEEE full-precision (FP32
) to BFloat16
, it's easier. We change a number from say 2.46392 to 2.5. We just lost some precision, but the converted result is still close to the original number. With fixed point/int, it is confusing because it seems like we are changing a number from say 253 to 56, which is a different scale altogether.
I hope this is making sense. Please correct me if I am wrong somewhere.
question from:
https://stackoverflow.com/questions/65713031/how-do-we-reduce-the-precision-of-accumulated-values-in-neural-networks-when-usi 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…