To transpose a 5D matrix [X, Y, Z, W, Q] into [Z, X, W, Y, Q], i.e. permute the order from (0, 1, 2, 3, 4) to (2, 0, 3, 1, 4), we can simplify the process into two steps:
-
Transform [X, Y, Z, W, Q] to [X, Z, W, Y, Q], which is (0, 1, 2, 3) → (0, 2, 1, 3).
-
Transform [X, Z, W, Y, Q] to [Z, X, W, Y, Q], which is (0, 1, 2) → (1, 0, 2).
I’m unsure if this can be accomplished with a single kernel function that uses shared memory for coalesced access, as this application deviates significantly from the classical tutorial found at https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/.
1