Usually, when writing a SIMD like function over a large array of data that might not divide cleanly by register sizes, you can do the bulk with SIMD and then do the last little bit using scalar like code.
However, for the code I am currently writing, it is not a simple loop of an array from beginning to end. Instead, the memory reads/writes are somewhat random, such that at any point in the loop I might need to read/write to an address that might result in reading/writing past the end of the array.
From a little research, I have seen that I can use MaskLoad
and MaskStore
. Whilst this solves my problem in not reading/writing out of bounds, it also kills the performance.
The MaskStore
seems to increase the time taken by about 30%.
I’m wondering if there is an alternative I can use?
I have read about BlendVariable
, but I don’t think that helps, as you still have the issue of reading/writing past the bounds of the array.