I have an external kernel cuda function which I call using a RawKernel
.
This kernel function is defined in a first .cu
file.
From this kernel function, I call some auxiliary __device__
functions which are defined in a different .cuh
file.
If I compile the main cuda kernel function by calling the RawKernel
and after that modify the auxiliary __device__
function in the .cuh
file and recompile, the output of the main kernel execution does not reflect the changes in the auxiliary __device__
functions.
The main kernel is defined as:
extern "C" __global__ void lin_trans(cudaTextureObject_t texObj, float *output, int Nx, int Ny, float x_min, float x_max, float y_min, float y_max, float *x_coord, float *y_coord) {
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
float x_trans = transformToTexelCoord(x_coord[y*Nx+x],x_min,x_max,Nx);
float y_trans = transformToTexelCoord(y_coord[y*Nx+x],y_min,y_max,Ny);
if (x < Nx && y < Ny)
output[y*Nx+x] = mytex2Dcubic<float>(texObj,x,y);
};
Then, mytex2Dcubic
is defined as (in a separate .cuh
file which I include at the top of the main .cu
file)
template <class R>
__device__ R mytex2Dcubic(const cudaTextureObject_t tex, float x, float y) {
R r = 500.0;
return r;
};
Finally, the cupy code is defined as follows:
d = Path(__file__).resolve().parent
source_path = str(d / "bicubic_inter/bicubicTexture_k.cu")
include_path = str(d / "bicubic_inter")
with open (source_path,"r") as file:
cuda_source = file.read()
include = "-I" + include_path
....
//Additional code not relevant to this problem
....
my_kernel = cp.RawKernel(cuda_source,'lin_trans',(include,),backend='nvcc')
threads_x = 16
threads_y = 16
grid_x = (Nx+threads_x-1) // threads_x
grid_y = (Ny+threads_y-1) // threads_y
my_kernel((grid_x,grid_y),(threads_x,threads_y),(texobj,real_output,Nx,Ny,cp.float32(-10.0),cp.float32(10.0),cp.float32(-10.0),cp.float32(10.0),xx,yy))
In this example, only the first time I compile the code the output that I obtain is correct (an array with entries equal to 500.0). However, if I change the value of r in the mytex2Dcubic
and recompile, the output of the execution does not change.
If instead of using a separate file to define mytex2Dcubic
, I define it in the main cuda file from where the RawKernel
is called, then this problem is not there any more.
Ideally I would like to separate the functions across various files (easier to read and go through)
Any idea about what could be happening here?