I have some familiarity with Halide and am starting to learn to use CUDA with it. To start with I ran the halide CUDA app that comes with Halide source code: cuda_mat_mul: https://github.com/halide/Halide/tree/main/apps/cuda_mat_mul
I got some reasonable if unimpressive timings:
cuda_mat_mal:
CPU, autoschedule, Adams2019: 4.2ms
GPU, autoschedule, Anderson2021: 3.0ms
GPU, manual schedule: 1.2ms
CUBLAS: 0.42ms
Does this seem right? I have an Nvidia RTX 3050 Ti laptop GPU and a core i5-11400h cpu
I then tried to get another sample app: camera_pipe running on the GPU:
https://github.com/halide/Halide/tree/main/apps/camera_pipe
It comes with schedules for both cpu and gpu. The cmake file is for cpu only. I modified it to do a CUDA build by setting
FEATURES cuda cuda_capability_50
and giving it CUDA_INCLUDE_DIRS and CUDA_LIBRARIES
just like in the cuda_mat_mul app.
I also added
output.copy_to_host();
in process.cpp
I recorded the following run times:
CUDA (manual): 1270ms
cpu auto_schedule Adams19: 10.5ms
cpu manual: 9.9ms
So CUDA was way slower than cpu.
This was with a single timing iteration. I then tried doing 100 iterations.
CUDA manual 1st iteration: 1270ms
next 100 iterations: 5.7ms
I then tried 100 iterations on cpu:
cpu manual: 1st iteration: 10.8ms
next 100 iterations: 5.8ms
cpu auto_schedule (Adams19) 100 iterations: 7ms
Why is the first GPU iteration so slow? Why are subsequent runs almost the same speed on CPU and GPU?
I verified that it was generating the correct output image. I also tried setting input.set_host_dirty(); but it made no difference.
I tried auto scheduling on GPU using Anderson2021 but got the following error:
C:Userscordosourcereposcamera_pipe17outbuildx64-Debugcamera_pipe_auto_schedule.runtime.lib(camera_pipe_auto_schedule.runtime.obj) : error LNK2005: .weak._ZN6Halide7Runtime8Internal13custom_mallocE.default.halide_internal_aligned_alloc already defined in camera_pipe.runtime.lib(camera_pipe.runtime.obj)
There were several more similar looking errors.
Thanks