I am quite new to CUDA in Julia but I have been able to obtain a speed up of a code through using just threads in CUDA in julia. However, depending on the complexity of the code the threads are limited by some number less than 1024. I do a summation over a set of N points, in which if N is greater than 340 it cannot use just threads. Therefore, I have looked at implementing blocks into the code, which I have been able to get to run for greater N. However, there are synchronisation issues here as I only have the sync_threads() command. Is there anywhere to synchronise the blocks and the threads? I will try and sketch the code below:
function tot(x,y,N)
ii = threadIdx().x + (blockIdx().x-1) * blockDim().x
if ii > N
return
end
* some detailed code *
sync_threads()
return nothing
end
CUDA.@sync begin
threads_per_blocks = 256
blocks_per_grid = ceil(Int,N/Threads_per_block)
@cuda threads=threads_per_blocks blocks=blocks_per_grid shmem=sizeof(N)^2 fastmath=true tot(x,y,N)
end
synchronize()
How can I be able to adapt this so that the blocks is synchronised such that the overall process could result with the same value if it was possible to increase the threads to the same number as N?