I use a pseudospectral DNS code for fluid simulations (a code I inherited) and I’m trying to boost performance by replacing the old FFT routines with equivalent FFTW routines. I have done this successfully in that I am getting the correct answers in my test cases, but I feel like I’m doing some things inefficiently, and I would appreciate tips on specific things that might improve my code. I will show a sample code snippet of the FFT routines and then ask specific questions based on what I’ve observed so far.
Sample Code:
! Complex array u of size (nyp,nz,nx/2) is stored in us and normalized before transform
! mx = (3/2)nx, mz = (3/2)nz for de-aliasing
...
complex(C_DOUBLE_COMPLEX),dimension(nyp,mz,mx) :: us,aspec
real(C_DOUBLE),dimension(nyp,mz,mx) :: up,aphys
...
! Plan FFTW transforms with dummy variables
planZb = fftw_plan_dft_1d(mz,aspec,aspec,FFTW_BACKWARD,FFTW_PATIENT)
planXb = fftw_plan_dft_c2r_1d(mx,aspec,aphys,FFTW_PATIENT)
planY = fftw_plan_r2r_1d(nyp,aphys,aphys,FFTW_REDFT00,FFTW_PATIENT)
...
! Complex --> Complex z-transform
do k = 1,nxh
do i = 1,nyp
call fftw_execute_dft(planZb,us(i,:,k),us(i,:,k))
.
.
.
end do
end do
! Complex --> Real x-transform
do j = 1,mz
do i = 1,nyp
call fftw_execute_dft_c2r(planXb,us(i,j,:),up(i,j,:))
.
.
.
end do
end do
! Real --> Real y-transform (DCT-I)
do k = 1,mx
do j = 1,mz
call fftw_execute_r2r(planY,up(:,j,k),up(:,j,k))
.
.
.
end do
end do
! Do stuff here
! Inverse transforms here, reverse process above + normalizations
Notes/Questions:
-
I use OpenMP threading and a few compiler optimizations, not shown here. It does speed up performance quite a bit, but I want to focus on how I’m using FFTW and arranging my data to improve performance
-
In the full version of the code, I’m doing each transform on 58 different variables of identical size to
u
/us
/up
. Reading the FFTW documentation, they recommend making a plan for each variable you do a transform on since subsequent plans are cheap to compute, but is that really useful for such a large number of variables? (Also the inverse transforms are on different variables) -
I have tried using
fftw_plan_many_dft...
for the transforms instead of using loops like I show above. However, this requires me to shuffle the data such that the transform direction index (x, y, or z) is first, and simple tests I’ve done prove this method to be much slower, especially as the grid size increases. Is there an efficient way to do this using FFTW’s rank 0 transforms? -
I do sequential 1D transforms in the x- and z- directions, but I can also do this as a single 2D transform. However, in a test code, I found that the 2D transform was comparable in compute time, even slightly slower. Is this expected, or would that warrant further investigation to get better performance from the 2D transform?
Thanks!