According to PyTorch’s documentation on CUDA semantics, GPU operations are asynchronous and may run in parallel, given enough resources.
I have a function similarity(x: Tensor, y: Tensor) -> Tensor
that performs a sequence of GPU operations using only PyTorch’s API, without any synchronization (no calls to .cpu()
, .item()
, etc., and no IO operations) in-between.
I call this function multiple times for independent computations of some similarity metric:
sim = [[similarity(m[:, i], m[:, j]) for j in range(m.shape[1])]
for i in range(m.shape[1])]
Borrowing the method of comparison from this answer, I get ~2.5 seconds of runtime in both asynchronized and synchronized cases. Even if I wrap all operations within the function in a new CUDA stream for each call, I still get ~2.5 seconds of runtime. If I use a much smaller tensor m
and adjust the function to use less resources (thereby trading-off accuracy), I get ~0.17 seconds of runtime in both cases.
Given the equal runtime in both cases, and the fact that the GPU is barely utilized, it smells like there is no parallelization at all. What am I missing here?