Slowly learning SIMD, but there are still some aspects that I cannot wrap my head around when trying to come up with SIMD solutions to a problem. One of those being when the input is smaller than the output.
As an example, lets say I have an 8 bit grayscale image. i.e each pixel is a byte in the range 0-255. And I now want to convert that to a pre-multiplied alpha image with a specified colour. So the input is an 8bit array (8 bit per pixel), but the output is a 32bit array (32 bit per pixel RGBA_8888).
So the arrays are not one to one. One byte in the grayscale array will be converted to 4 bytes in the colour array .
In scalar form, this would look like this :
public class Test
{
const int ImageSize = 2048;
const int ImageLength = ImageSize * ImageSize;
private byte[] _bytesGray = new byte[ImageLength];
private uint[] _pixelsRGBA = new uint[ImageLength];
private const byte _colorR = 0xFF;
private const byte _colorG = 0x01;
private const byte _colorB = 0x02;
private const byte _colorA = 0xFF;
[GlobalSetup]
public void Setup()
{
for (int i = 0; i < ImageLength; i++)
{
_bytesGray[i] = (byte)(i + 1);
_pixelsRGBA[i] = 0;
}
}
[Benchmark]
public unsafe void GrayscaleToColor_Scalar()
{
fixed (byte* bytePtr = _bytesGray)
fixed (uint* pixelPtr = _pixelsRGBA)
{
for (int i = 0; i < ImageLength; ++i)
{
byte value = bytePtr[i];
byte r = (byte)((value * _colorR) >> 8);
byte g = (byte)((value * _colorG) >> 8);
byte b = (byte)((value * _colorB) >> 8);
byte a = (byte)((value * _colorA) >> 8);
pixelPtr[i] = (uint)(r << 24 & g << 16 & b << 8 & a);
}
}
}
}
To process this in SIMD form, my thinking is that I want to process the _bytesGray
array so I can take full advantage of the 256 vector register.
fixed (byte* valueBytes = _bytesGray)
{
for (int i = 0; i < ImageLength; i += 32)
{
...
}
}
But as the input pixel is a byte
and the output pixel is a uint
, I think I would then need 4x Vectors. Where each vector contains 8 of the 32 bytes. But each byte would then be duplicated 4 times in each uint
. At which point I could then do my multiply
fixed (byte* valueBytes = _bytesGray)
{
for (int i = 0; i < ImageLength; i += 32)
{
Vector256<byte> bytes = Avx2.LoadVector256(valueBytes + i);
Vector256<uint> _0_8_grayBytes = ... // get 0-8 bytes and splat each byte to fill uint
Vector256<uint> _8_16_grayBytes = ... // get 8-16 bytes and splat each byte to fill uint
Vector256<uint> _16_24_grayBytes = ... // get 16-24 bytes and splat each byte to fill uint
Vector256<uint> _24_32_grayBytes = ... // get 24-32 bytes and splat each byte to fill uint
}
}
But I can’t seem to figure out how to express what I want in instructions to do or even if its the right approach.
How would you go about doing this?