I am trying to speed up a function that, given a complex-valued array arr
with n
entries, calculates the sum of m
operations on that array using BLAS routines. Finally, it replaces the values of arr
.
For simplicity, assume the operations to be matrix-vector multiplications with different square matrices mats
and arr
being the vector, followed by a scalar multiplication from an array of scalars scalars
. In reality, they are rather combinations of functions from this question Apple’s dispatch vs OpenMP to parallelize a for loop on Apple MacBook Pro with M3Pro, but I don’t want to bother you with too much detail here; only if necessary.
Introducing BLAS already increased my performance by 30-40% and I thought parallelizing the following code would be a good idea.
void sumProducts(double complex** arr,
uint64_t n,
double complex** mats,
double complex* scalars,
uint64_t m)
{
__LAPACK_int N = (__LAPACK_int) n;
complex double* tmp = calloc(n, sizeof(complex double));
complex double* tmpSum = calloc(n, sizeof(complex double));
__LAPACK_double_complex beta = 0;
for (uint64_t i = 0; i < m; ++i) {
__LAPACK_double_complex alpha = 1;
cblas_zgemv(CblasRowMajor,
CblasNoTrans,
N,
N,
&alpha,
mats[i],
N,
*arr,
1,
&beta,
tmp,
1
);
alpha = (__LAPACK_double_complex) scalars[i];
cblas_zaxpy(N, &alpha, tmp, 1, tmpSum, 1);
}
free(tmp);
*arr = tmpSum;
}
My best guess on how to use OpenMP for this was to parallelize the for loop and let any product be computed by a single thread (see Apple’s dispatch vs OpenMP to parallelize a for loop on Apple MacBook Pro with M3Pro) and then let any single thread add its result to the sum array:
void sumProductsOmp(double complex** arr,
uint64_t n,
double complex** mats,
double complex* scalars,
uint64_t m)
{
__LAPACK_int N = (__LAPACK_int) n;
complex double* tmp;
complex double* tmpSum = calloc(n, sizeof(complex double));
__LAPACK_double_complex beta = 0;
# pragma omp parallel default(none) private(tmp) shared(arr, beta, m, mats, n, N, scalars, tmpSum)
{
#pragma omp for
for (uint64_t i = 0; i < m; ++i) {
tmp = calloc(n, sizeof(complex double));
__LAPACK_double_complex alpha = 1;
cblas_zgemv(CblasRowMajor,
CblasNoTrans,
N,
N,
&alpha,
mats[i],
N,
*arr,
1,
&beta,
tmp,
1
);
alpha = (__LAPACK_double_complex) scalars[i];
#pragma omp critical(sum)
{
cblas_zaxpy(N, &alpha, tmp, 1, tmpSum, 1);
}
free(tmp);
}
}
free(tmp);
*arr = tmpSum;
}
However, the outcome of my performance test was, in my opinion, rather descent for my MacBook Pro with M3Pro chip.(Results are in seconds)
Here, n = 2^p
and the set of matrices is the set of Pauli products on p
sites with m = 4^p
.
n m serial omp
16 256 0.000079541 0.000554125
32 1024 0.001001458 0.000519291
64 4096 0.013269750 0.004418000
128 16384 0.194747125 0.067422792
256 65536 11.373802375 8.255758541
Many other questions were concerned with using atomic or some other directive rather than critical for performance reasons. I don’t see that those answers are applicable here since most only care for entries to a single matrix.
Do you see a more elegant way to delegate this task?
Moreover, I experienced process interruptions for p = 9
with signal 9:SIGKILL. I guess this is due to the large number of matrices (262144) with the same amount of entries each.
Does this signal mean that I consumed all my memory?
2
Do you see a more elegant way to delegate this task?
Not from the BLAS-based starting point presented in the question, no. It’s conceivable that there is an overall better approach to the original problem, but parallelizing linear algebra computations does tend to run into issues with data dependencies. In this case, that manifests as the need to serialize calls to cblas_zaxpy()
.
Perhaps, though, you could realize an improvement by pre-allocating all the temp space needed for the selected number of threads, once, instead of having each thread allocate and deallocate.
Additionally, if you were planning multiple calls to sumProducts()
then it’s possible that performing those calls in parallel could provide better performance than parallelizing the implementation of sumProducts()
itself does.
I experienced process interruptions for p = 9 with signal 9:SIGKILL. I guess this is due to the large number of matrices (262144) with the same amount of entries each. Does this signal mean that I consumed all my memory?
Not directly. If you had exhausted available memory then you would expect first that one of your calloc()
calls would fail (which you should be testing for) and that as a result, one or more subsequent operations would attempt a null-pointer dereference. That would normally cause a segmentation fault (SIGSEGV
). Possibly your OpenMP implementation is handling such a segmentation fault by raising a SIGKILL
, but it may also be that the SIGKILL
was dispatched to the program externally, for reasons that are not immediately evident.