AI/ML training traditionally has been performed using floating point data formats, primarily because that is what was available. But this usually isn’t a viable option for inference on the edge, where ...
Multiplication on a common microcontroller is easy. But division is much more difficult. Even with hardware assistance, a 32-bit division on a modern 64-bit x86 CPU can run between 9 and 15 cycles.
An unfortunate reality of trying to represent continuous real numbers in a fixed space (e.g. with a limited number of bits) is that this comes with an inevitable loss of both precision and accuracy.