I have this code that blits a bitmap onto the frame buffer with SSE2 intrinsics:
for (uint r = 0; r < height; r++)
{
uint32* bufPixels = (frameBuffer->pixels + xPos) + frameBuffer->pitch * (r + yPos);
uint32* bmpPixels = bitmap->pixels + bitmap->pitch * r;
for (uint p = 0; p + NUM_SIMD_PIXELS /* 4 */ <= width; p += NUM_SIMD_PIXELS)
{
__m128i bmpVec = _mm_loadu_si128((__m128i*)(bmpPixels + p));
_mm_storeu_si128((__m128i*)(bufPixels + p), bmpVec);
}
}
I want to implement scaling into this, but I can’t grasp myself on any ideas without having to resort to scalar operations.
With SSE2 I am always working with 4 pixels at once and thus I can’t just loop over width * xScale
and do p / xScale
on each of the pixels to get the actual pixel index in the bitmap’s pixel buffer.
Not really looking for code examples, more just ideas.
EDIT: here is a scalar example of the scaling I wish to accomplish:
for (uint r = 0; r < height * yScale; ++r)
{
uint bmpRowIndex = (uint)(r / yScale);
uint32* bufPixels = (frameBuffer->pixels + xPos) + frameBuffer->pitch * (r + yPos);
uint32* bmpPixels = bitmap->pixels + bitmap->pitch * bmpRowIndex;
for (uint p = 0; p < width * xScale; ++p)
{
uint bmpPixelIndex = (uint)(p / xScale);
bufPixels[p] = bmpPixels[bmpPixelIndex];
}
}
It’s basic nearest-neighbor sprite scaling, nothing fancy. I just want this but in SIMD.
14
Based on the comments, it seems you only interested in the nearest neighbor resampling, and only for RGBA8 pixel format. I think for your use case, SIMD is borderline useless. While it is possible to do something smart with _mm_shuffle_epi8
or if you have AVX with _mm_permutevar_ps
, not sure you going to have much profit from these, if at all.
Assuming you compile for 64 bits, try the following version. The code is untested but I hope the idea is clear. I’ll be surprised to find it’s possible to speed up with SIMD by any meaningful margin, unless restricting the scaling multiplier into small rational numbers.
#include <stdint.h>
#include <assert.h>
#include <cmath>
#include <algorithm>
struct RgbaBitmap
{
// Bitmap data in system memory
uint32_t* pointer;
// Distance between rows; expressed in uint32_t elements, not bytes
size_t rowPitch;
// Size of the bitmap
int width, height;
};
// End value for that 32.32 fixed-point number, incremented by `step` on each iteration
inline uint64_t scaledEnd32( uint64_t step, int rt, int sprite )
{
if( rt > 0 && sprite > 0 )
{
// End value for the input sprite
uint64_t sprite64 = (uint32_t)sprite;
uint64_t endSprite = sprite64 << 32;
// End value for the output bitmap
uint64_t rt64 = (uint32_t)rt;
uint64_t endRt = rt64 * step;
// Return the minimum of them
return std::min( endSprite, endRt );
}
else
return 0;
}
void scaleBitmap( const RgbaBitmap& target, const RgbaBitmap& sprite,
int xPos, int yPos, float scaling )
{
// Supporting negative xPos / yPos is possible but rather tricky, not doing in this example
assert( xPos >= 0 );
assert( yPos >= 0 );
assert( scaling > 0.0 );
// Compute inverse of the scaling multiplier.
// The following codes need to scale opposite way, output pixels -> sprite pixels.
// Also convert into 32.32 fixed point, rounding for optimal precision.
constexpr double p32 = (double)( (int64_t)1 << 32 );
const uint64_t step = (uint64_t)std::llround( p32 / scaling );
const size_t destPitch = target.rowPitch;
const size_t sourcePitch = sprite.rowPitch;
uint32_t* rdiLine = target.pointer + yPos * destPitch + xPos;
const uint64_t fxEnd = scaledEnd32( step, target.width - xPos, sprite.width );
const uint64_t fyEnd = scaledEnd32( step, target.height - yPos, sprite.height );
// The outer loop is by output rows
for( uint64_t fy = 0; fy < fyEnd; fy += step, rdiLine += destPitch )
{
// Sprite Y coordinate for the current output row
const size_t sourceY = ( fy >> 32 );
assert( (int64_t)sourceY < sprite.height );
// Source pointer to read from the sprite
const uint32_t* const rsi = sprite.pointer + sourcePitch * sourceY;
// The inner loop is within a single row, each iteration makes an output pixel
uint32_t* rdi = rdiLine;
for( uint64_t fx = 0; fx < fxEnd; fx += step, rdi++ )
{
const size_t sourceX = ( fx >> 32 );
assert( (int64_t)sourceX < sprite.width );
*rdi = rsi[ sourceX ];
}
}
}
If you compile for a 32 bit CPU where 64 bit integer arithmetic is expensive, refactor the code into 16.16 fixed point i.e. use (double)( 1 << 16 )
multiplier and shift numbers by 16 bits when sampling from the source sprite.
2