I am learning about ARM-v8 Aarch64 SIMD instructions hoping I can optimize some calculations. In this case, I am looking for modulo operation on a 4xf32 vector.
How can I implement a modulo with the NEON instruction set?
Note: I actually am looking for something to make sure my angle values stay between -PI and +PI, so I am also interested in other solutions for that.
Note: currently I am trying to do it with the arm_neon.h header in C, but I might at some point do it directly with assembly for even more optimization of combining instructions without storing the results in variables.