I have a large matrix and I want to output all the indices where the elements in the matrix are less than 0. I have this MWE in numba:
import numba as nb
import numpy as np
A = np.random.random(size = (1000, 1000)) - 0.1
@nb.njit(cache=True)
def numba_only(arr):
rows = np.empty(arr.shape[0]*arr.shape[1])
cols = np.empty(arr.shape[0]*arr.shape[1])
idx = 0
for i in range(arr.shape[0]):
for j in range(A.shape[1]):
if arr[i, j] < 0:
rows[idx] = i
cols[idx] = j
idx += 1
return rows[:idx], cols[:idx]
Timing I get:
%timeit numba_only(A)
2.29 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is a little faster than np.where(A<0)
which gives:
%timeit numpy_only(A)
3.56 ms ± 59.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Can the numba code be sped up by parallelization somehow? I realise it might be memory bound but I think that modern hardware should allow some level of parallel access to memory.
I am not sure how to use nb.prange to achieve this due to the index idx in the loop.