void gemv_columnwise_stride_neon(float* result, const float* matrix, const float* vector, int m, int n, int stride) {
std::memset(result, 0, n * sizeof(float));
for (int i = 0; i < m; ++i) {
float vec_val = vector[i];
int j = 0;
for ( ; j <= n - 4; j += 4) {
float32x4_t mat_val = vld1q_f32(matrix + i * stride + j);
float32x4_t mul_val = vmulq_n_f32(mat_val, vec_val);
float32x4_t r = vld1q_f32(result + j);
vst1q_f32(result + j, vaddq_f32(r, mul_val));
}
for (; j < n; ++j) {
result[j] += matrix[i * stride + j] * vec_val;
}
}
}
This code computes GEMV column-wise by multiplying the vector element-wise with each column of the matrix and summing the result. Both the matrix and vector are in row-major order. The variable result is also in row-major order and stores the result vector. matrix is the address of the input matrix, vector is the address of the input vector, m is the number of rows in the matrix, n is the number of columns, and stride represents the address offset between rows (i.e., the difference between the position of the first element in the first row and the first element in the second row). I believe the implementation is correct, but the results are wrong, and I’m not sure what the issue is. Could anyone provide a hint? Additionally, is there a more optimal way to compute GEMV column-wise than this approach?
XuZhi is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
3