I am trying to understand Metal computing, but I get completely unexpected results, even with the most simple code that I can imagine. Here is my setup:
I have the following unit test function:
func test_test() throws {
let metalImageProcessor = MetalImageProcessor()!
metalImageProcessor.test()
}
The idea is that after the test completed, I can check the logs in the debugger.
My MetalImageProcessor
is the following:
func test() {
let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let testFunction = library.makeFunction(name: "testFunction")!
let pipelineState = try! device.makeComputePipelineState(function: testFunction)
let commandBuffer = commandQueue.makeCommandBuffer()!
let commandEncoder = commandBuffer.makeComputeCommandEncoder()!
let threadExecutionWidth = makeTexturePipelineState.threadExecutionWidth
let threadExecutionHeight = makeTexturePipelineState.maxTotalThreadsPerThreadgroup / threadExecutionWidth
let threadsPerThreadgroup = MTLSize(width: threadExecutionWidth, height: threadExecutionHeight, depth: 1)
let threadsPerGrid = MTLSize(width: threadExecutionWidth, height: threadExecutionHeight, depth: 1)
let outputBufferSize = threadsPerGrid.width * threadsPerGrid.height * MemoryLayout<Float>.size // shader writes 10 floats
var buffer1 = [UInt8](repeating: 0, count: outputBufferSize)
let outputBuffer1 = device.makeBuffer(bytes: &buffer1, length: outputBufferSize, options: .storageModeShared)
commandEncoder.setComputePipelineState(pipelineState)
commandEncoder.setBuffer(outputBuffer1, offset: 0, index: 0)
commandEncoder.dispatchThreadgroups(threadsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
debugBufferValues(name: "outputBuffer1", buffer: outputBuffer1!, width: threadExecutionWidth, height: threadExecutionHeight)
}
func debugBufferValues(name: String, buffer: MTLBuffer, width: Int, height: Int) {
let debugData = buffer.contents().bindMemory(to: Float.self, capacity: width * height)
print("n")
for i in 0 ..< min(width * height, 8320) {
print(name, "[(i)]: (debugData[i])")
}
}
It calls the following shader kernel function:
kernel void testFunction(device float *outputBuffer [[buffer(0)]],
uint2 gridSize [[grid_size]],
uint2 threadGroupIDinGrid [[threadgroup_position_in_grid]],
uint2 threadIDinThreadGroup [[thread_position_in_threadgroup]],
uint2 threadGroupSize [[threads_per_threadgroup]],
uint2 gid [[thread_position_in_grid]]) {
uint nrFloatsPerThread = 10; // see below
uint testValueIndex = 0; // start value
uint outputBufferIndex = (gridSize.x * gid.y + gid.x) * nrFloatsPerThread + testValueIndex;
outputBuffer[outputBufferIndex] = gridSize.x; testValueIndex += 1;
outputBuffer[outputBufferIndex] = gridSize.y; testValueIndex += 1;
outputBuffer[outputBufferIndex] = threadGroupIDinGrid.x; testValueIndex += 1;
outputBuffer[outputBufferIndex] = threadGroupIDinGrid.y; testValueIndex += 1;
outputBuffer[outputBufferIndex] = threadIDinThreadGroup.x; testValueIndex += 1;
outputBuffer[outputBufferIndex] = threadIDinThreadGroup.y; testValueIndex += 1;
outputBuffer[outputBufferIndex] = threadGroupSize.x; testValueIndex += 1;
outputBuffer[outputBufferIndex] = threadGroupSize.y; testValueIndex += 1;
outputBuffer[outputBufferIndex] = gid.x; testValueIndex += 1;
outputBuffer[outputBufferIndex] = gid.y; testValueIndex += 1;
}
What I expected:
Since the grid has the same dimensions as the thread group, there is only a single thread group, in my case of size (32,16,1).
Every thread in this group executes the shader function, and writes 10 floats in the output buffer with the following expected values:
outputBuffer1 [0]: 32.0 // gridSize.x
outputBuffer1 [1]: 16.0 // gridSize.y
outputBuffer1 [2]: 0.0 // threadGroupIDinGrid.x
outputBuffer1 [3]: 0.0 // threadGroupIDinGrid.y
outputBuffer1 [4]: 0.0 // threadIDinThreadGroup.x
outputBuffer1 [5]: 0.0 // threadIDinThreadGroup.y
outputBuffer1 [6]: 32.0 // threadGroupSize.x
outputBuffer1 [7]: 16.0 // threadGroupSize.y
outputBuffer1 [8]: 0.0 // gid.x
outputBuffer1 [9]: 0.0 // gid.y
outputBuffer1 [0]: 32.0 // gridSize.x
outputBuffer1 [1]: 16.0 // gridSize.y
outputBuffer1 [2]: 0.0 // threadGroupIDinGrid.x
outputBuffer1 [3]: 0.0 // threadGroupIDinGrid.y
outputBuffer1 [4]: 1.0 // threadIDinThreadGroup.x
outputBuffer1 [5]: 0.0 // threadIDinThreadGroup.y
outputBuffer1 [6]: 32.0 // threadGroupSize.x
outputBuffer1 [7]: 16.0 // threadGroupSize.y
outputBuffer1 [8]: 1.0 // gid.x
outputBuffer1 [9]: 0.0 // gid.y
and so forth.
What I got:
outputBuffer1 [0]: 250.0
outputBuffer1 [1]: 0.0
outputBuffer1 [2]: 0.0
outputBuffer1 [3]: 0.0
outputBuffer1 [4]: 0.0
outputBuffer1 [5]: 0.0
outputBuffer1 [6]: 0.0
outputBuffer1 [7]: 0.0
outputBuffer1 [8]: 0.0
outputBuffer1 [9]: 0.0
outputBuffer1 [10]: 250.0
outputBuffer1 [11]: 0.0
outputBuffer1 [12]: 0.0
outputBuffer1 [13]: 0.0
outputBuffer1 [14]: 0.0
outputBuffer1 [15]: 0.0
outputBuffer1 [16]: 0.0
outputBuffer1 [17]: 0.0
outputBuffer1 [18]: 0.0
outputBuffer1 [19]: 0.0
and so forth.
Here already the first 10 values in the output buffer are wrong. I neither understand the 250.0 in outputBuffer1 [0]
, nor the following zeroes.
So something is basically wrong with my code, but I don’t find the bug.
Any help is welcome!