I need some help understanding fixed point arithmetic and quantization. Can someone correct me if I’m wrong?
Does having large number of fractional bits introduce quantization noise in the least significant bits? Also, when the fractional bits do not have the right resolution to represent very small numbers like i.e. 0.00106465, in this case, we need 22 bits so 1/2^22 so 0.000002 resolution would map to exactly same like original value.
However, if having 13 fractional bits is enough to represent the most significant bits of original number, that should be preferred over 16 or 18 fractional bits? Because when I’m using <24,11>
(11 integer bits and 13 fractional bits) configuration, I gain a bit of accuracy but when I use higher precision configuration like <32,16>
, I lose accuracy.
Then again, when I use <32,10>
configuration, I don’t lose any accuracy but gain a bit.
I calculate the accuracy based on the results I get when using float datatype.
I was expecting the accuracy to become worst in case of <24,11>
but turns out I was wrong
1