If your program doesn't depend on the full precision of floating point operations, the performance-accuracy trade-off will probably be worth it, increasing FLOPS throughput 4-5x. However, in some ...
A Python tool for measuring and comparing whether sin(x) and cos(x) are implemented as standard (≤2 ULP guaranteed) or as approximate “fast-math” versions across multiple GPU backends (Vulkan, D3D12, ...