I have a rather simple function that runs quite fast in parallel using Numba
, and I would like to know if I can run it on my Apple M3 Max GPU. However I have never worked on GPU code before (coming from Macs), so I am a little lost..
I have included a small use case:
<code>import numpy as np
import numba as nb
N = 15 # size of the a and b
K = 127 # size of the result
L = 943 # size of the operator
a = np.random.standard_normal(size=N)
b = np.random.standard_normal(size=N)
operator = np.zeros(shape=(L, 4), dtype=np.int64)
operator[:, 0] = np.random.randint(size=L, low=0, high=N)
operator[:, 1] = np.random.randint(size=L, low=0, high=N)
operator[:, 2] = np.random.randint(size=L, low=0, high=K)
operator[:, 3] = np.random.randint(size=L, low=1, high=10)
@nb.njit(parallel=True)
def shuffle_mul(a: np.ndarray, b: np.ndarray, operator: np.ndarray) -> np.ndarray:
res = np.zeros(shape=K, dtype=a.dtype)
for n in nb.prange(len(operator)):
i, j, k, count = operator[n]
res[k] += count * a[i] * b[j]
return res
shuffle_mul(a, b, operator=operator) # warm-up
%timeit shuffle_mul(a, b, operator=operator) # 173 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
</code>
<code>import numpy as np
import numba as nb
N = 15 # size of the a and b
K = 127 # size of the result
L = 943 # size of the operator
a = np.random.standard_normal(size=N)
b = np.random.standard_normal(size=N)
operator = np.zeros(shape=(L, 4), dtype=np.int64)
operator[:, 0] = np.random.randint(size=L, low=0, high=N)
operator[:, 1] = np.random.randint(size=L, low=0, high=N)
operator[:, 2] = np.random.randint(size=L, low=0, high=K)
operator[:, 3] = np.random.randint(size=L, low=1, high=10)
@nb.njit(parallel=True)
def shuffle_mul(a: np.ndarray, b: np.ndarray, operator: np.ndarray) -> np.ndarray:
res = np.zeros(shape=K, dtype=a.dtype)
for n in nb.prange(len(operator)):
i, j, k, count = operator[n]
res[k] += count * a[i] * b[j]
return res
shuffle_mul(a, b, operator=operator) # warm-up
%timeit shuffle_mul(a, b, operator=operator) # 173 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
</code>
import numpy as np
import numba as nb
N = 15 # size of the a and b
K = 127 # size of the result
L = 943 # size of the operator
a = np.random.standard_normal(size=N)
b = np.random.standard_normal(size=N)
operator = np.zeros(shape=(L, 4), dtype=np.int64)
operator[:, 0] = np.random.randint(size=L, low=0, high=N)
operator[:, 1] = np.random.randint(size=L, low=0, high=N)
operator[:, 2] = np.random.randint(size=L, low=0, high=K)
operator[:, 3] = np.random.randint(size=L, low=1, high=10)
@nb.njit(parallel=True)
def shuffle_mul(a: np.ndarray, b: np.ndarray, operator: np.ndarray) -> np.ndarray:
res = np.zeros(shape=K, dtype=a.dtype)
for n in nb.prange(len(operator)):
i, j, k, count = operator[n]
res[k] += count * a[i] * b[j]
return res
shuffle_mul(a, b, operator=operator) # warm-up
%timeit shuffle_mul(a, b, operator=operator) # 173 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Question 1: Is it even theoretically possible/interesting to run such a function on GPU? The fact that multiple instances might read a
and b
or add to res
at the same time is a problem?
Question 2: What would be the best approach? Also, what library should I use? metalcompute
, jax
, PyTorch
, … ?
Thanks a lot!