I successfully executed both fwd and inverse cufft and used extra kernels between them and after the latter to scale their values.
I managed to add streams to the previous stated example. The app performs as expected.
On a third version I successfully used callbacks to replace such extra-scaling kernels as well. The callbacks are used to load Complex and Store Real on the inverse fft.
Finally, I wonder if:
is it possible to combine callbacks with streams to avoid using those extra scale kernels in order to run on independent streams (not the default or NULL one)? I implemented it but I got the following error:
`cuFFT error 6:CUFFT_EXEC_FAILED at cuda-fft.cu
when calling this line code at the inverse fft, which has the callbacks :
CHECK_CUFFT_ERRORS( cufftExecC2R( bwd, (cufftComplex*)out_d, (cufftReal*)d[0] ) );
The original version, using scaling-kernels between them and after the latter to scale their values. Summarised following:
// creating both the fwd & inverse 3D plan
if ((cufftPlan3d(&fwd, dim[0], dim[1], dim[2], CUFFT_R2C))!= CUFFT_SUCCESS) {printf("cufft forward_plan errorn"); exit(-1);}
if ((cufftPlan3d(&bwd, dim[0], dim[1], dim[2], CUFFT_C2R))!= CUFFT_SUCCESS) {printf("cufft inverse_plan errorn"); exit(-1);}
...
// performing the ffts using such plans
if ((cufftExecR2C(fwd, (cufftReal*)d[1], (cufftComplex*)out_d))!=CUFFT_SUCCESS) {printf("cufft execR2C-3D forward errorn"); exit(-1);}
scaling-kernel_1<<<,>>>
if ((cufftExecC2R(bwd, (cufftComplex*)out_d, (cufftReal*)d[0]))!=CUFFT_SUCCESS) {printf("cufft execC2R-3D inverse errorn"); exit(-1);}
scaling-kernel_2<<<,>>>
2nd version using streams
GPUerrchk(cudaStreamCreateWithFlags (&stream_fwd, cudaStreamNonBlocking));
// creating both the fwd & inverse 3D plan
if ((cufftPlan3d(&fwd, dim[0], dim[1], dim[2], CUFFT_R2C))!= CUFFT_SUCCESS) {printf("cufft forward_plan errorn"); exit(-1);}
if ((cufftSetStream(fwd, stream_fwd))!= CUFFT_SUCCESS) {printf("cufft cufftSetStream_fwd errorn"); exit(-1);}
GPUerrchk(cudaStreamCreateWithFlags (&stream_bwd, cudaStreamNonBlocking));
if ((cufftPlan3d(&bwd, dim[0], dim[1], dim[2], CUFFT_C2R))!= CUFFT_SUCCESS) {printf("cufft inverse_plan errorn"); exit(-1);}
if ((cufftSetStream(bwd, stream_bwd))!= CUFFT_SUCCESS) {printf("cufft cufftSetStream_bwd errorn"); exit(-1);}
...
// performing the ffts using such plans
if ((cufftExecR2C(fwd, (cufftReal*)d[1], (cufftComplex*)out_d))!=CUFFT_SUCCESS) {printf("cufft execR2C-3D forward errorn"); exit(-1);}
scaling-kernel_1<<<,,,stream_fwd>>>
if ((cufftExecC2R(bwd, (cufftComplex*)out_d, (cufftReal*)d[0]))!=CUFFT_SUCCESS) {printf("cufft execC2R-3D inverse errorn"); exit(-1);}
scaling-kernel_2<<<,,,stream_bwd>>>
3rd version: I successfully used callbacks to replace such extra-scaling kernels as well. The callbacks are used to load Complex and Store Real on the inverse fft:
...
cufftCallbackLoadC h_LoadCCallbackPtr;;
checkCudaErrors( cudaMemcpyFromSymbol(&h_LoadCCallbackPtr, d_LoadCCallbackPtr, sizeof(h_LoadCCallbackPtr)) );
checkCudaErrors( cufftXtSetCallback(bwd, (void **)&h_LoadCCallbackPtr, CUFFT_CB_LD_COMPLEX, (void **)&d_params_bwdL) );
cufftCallbackStoreR h_storeRCallbackPtr;
checkCudaErrors( cudaMemcpyFromSymbol(&h_storeRCallbackPtr, d_storeRCallbackPtr, sizeof(h_storeRCallbackPtr)) );
checkCudaErrors( cufftXtSetCallback(bwd, (void **)&h_storeRCallbackPtr, CUFFT_CB_ST_REAL, (void **)&d_params_bwdS) );
So, finally, again, I wonder if:
is it possible to combine callbacks with streams to avoid using those extra scale kernels in order to run on independent streams (not the default or NULL one)?