Some TensorRT conv layer forward blocked by cudaMemcpyAsync from another thread See this nsys profile: