I’m running an ONNX model exported from TorchDynamo, and noticed that it runs significantly slower with CUDAExecutionProvider
compared to CPUExecutionProvider
, with the CPU being ~3x as fast.
This is running on onnxruntime-gpu
1.17.1 and CUDA 12.
When I enabled verbose logging, I see the following lines:
2024-04-10 16:16:48.390701377 [I:onnxruntime:, cuda_execution_provider.cc:2397 GetCapability] CUDA kernel not found in registries for Op type: Rank node name: _inlfunc__aten_convolution_onnx_token_77_n0
2024-04-10 16:16:48.390746629 [I:onnxruntime:, cuda_execution_provider.cc:2397 GetCapability] CUDA kernel not found in registries for Op type: CastLike node name: _inlfunc_aten_constant_pad_nd_token_79_n19
2024-04-10 16:16:48.390752511 [I:onnxruntime:, cuda_execution_provider.cc:2397 GetCapability] CUDA kernel not found in registries for Op type: Rank node name: _inlfunc_aten_constant_pad_nd_token_79_n1
2024-04-10 16:16:48.390756692 [I:onnxruntime:, cuda_execution_provider.cc:2397 GetCapability] CUDA kernel not found in registries for Op type: CastLike node name: _inlfunc_aten_constant_pad_nd_token_79_n3
2024-04-10 16:16:48.390851044 [I:onnxruntime:, cuda_execution_provider.cc:2397 GetCapability] CUDA kernel not found in registries for Op type: Pad node name: _inlfunc_aten_constant_pad_nd_token_79_n20
2024-04-10 16:16:48.390855420 [I:onnxruntime:, cuda_execution_provider.cc:2397 GetCapability] CUDA kernel not found in registries for Op type: Rank node name: _inlfunc__aten_convolution_onnx_token_81_n0
2024-04-10 16:16:48.390957342 [I:onnxruntime:, cuda_execution_provider.cc:2397 GetCapability] CUDA kernel not found in registries for Op type: aten_leaky_relu node name: _inlfunc_torch_nn_modules_activation_LeakyReLU_getattr_L__self____generator_activations___0___29_aten_leaky_relu_0
2024-04-10 16:16:48.390963282 [I:onnxruntime:, cuda_execution_provider.cc:2397 GetCapability] CUDA kernel not found in registries for Op type: Rank node name: _inlfunc__aten_convolution_onnx_token_84_n0
This seems to indicate that those functions are dispatched to CPU instead, thus slowing down the overall execution.
Any idea what could have caused this?