I have 4 _mm256i, where each 4 bytes contains 2 that I care about, plus two garbage bytes.
_mm256i r0; // contains bytes a0 b0 xx xx a1 b1 xx xx a2 b2 xx xx a3 c3 xx xx
_mm256i r1; // likewise, but 4..7 instead of 0..3
_mm256i r2; // likewise, but 8..11
_mm256i r3; // likewise, but 12..15
I want to produce 2 output _mm256i, which are simply
_mm256i aa; // contains bytes a0 a1 a2 ... a15
_mm256i bb; // contains bytes b0 b1 b2 ... b15
how can I best do this with AVX2? A couple rough approaches I’m considering are
- repeated application of _mm256_unpack{lo,hi}_epi{16,32,64}, to combine the non-garbage and yes-garbage into larger and larger contiguous chunks, eventually ending up with all the actual data in the low 128 bits
- or, starting off with _mm256_shuffle_epi8 to immediately group the actual data in the low bits (per lane).
the shuffle seems like it gets rid of the garbage more quickly, which is probably good. However it’s both a more general (and expensive?) operation than I really need, since my data is in a regular pattern and I don’t need arbitrary reindexing, and also too restrictive (because it’s restricted to within-lane)
Note: I’m also separately interested in if there’s a better way with AVX512. In particular, it seems like _mm256_cvtepi32_epi16 might be a great way to filter out the garbage bytes? And, does anything change when scaling up to _mm512i (producing a0..a31 and b0..b31)?