I am a 4th year university student who is working on a Parallel computing course project. I have made a very bad decision in selecting the right algorithm to show case the gpu performance compared with its sequential counterpart. The Bat optimization algorithm is what I chose and the parallelization is as follows:
Thread 0: responsible for calculating average and applying stopping criteria by updating flag.
All other threads: updating each property of each bat in a generation/iteration of bat population.
First look at the code:
__device__ void startAlgo(Bat *bats, int N, unsigned long long seed){
__shared__ float global_best_fitness;
__shared__ float average_best_position_of_batSwarm;
__shared__ float global_best_position;
curandState *state = new curandState; // Should ideally be per-thread and persistent, not recreated in a loop
// Initialize random states once per thread
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < N) {
curand_init(seed, idx, 0, state);
}
if(threadIdx.x == 0){
setGlobalandAverage(bats, N, global_best_fitness, global_best_position, average_best_position_of_batSwarm);
while(!stopFlag){
printf("Average personal best position: %fn",average_best_position_of_batSwarm);
float prev_avg = average_best_position_of_batSwarm;
CalculateFitnessAverage(bats, N, average_best_position_of_batSwarm);
ApplyStoppingCriteria(prev_avg, average_best_position_of_batSwarm);
// syncThreads(); // Synchronize after potentially modifying stopFlag or other shared variables
}
} else {
while(!stopFlag){ // Ensure this check is dynamic
for(int i = threadIdx.x; i < N; i += blockDim.x * gridDim.x){ // Distribute work more evenly
performWork(&bats[i], global_best_position, state);
printf("id: %d, v: %f, p: %f, f: %f, l: %f, pr: %f, fit: %f, pbfit: %f, pbp: %fn",i,bats[i].velocity,bats[i].position,bats[i].frequency,bats[i].loudness,bats[i].pulse_rate,bats[i].fitness,bats[i].personal_best_fitness,bats[i].personal_best_position);
}
__syncthreads(); // Sync all threads to recheck the stopping flag
}
}
delete state; // Clean up the state
}
This function is called after the initialization.
I have confirmed the fact that each thread indeed enters the function but what happens is that only thread 0 is doing its work in a loop and since there is no update in the Bat population, the program stops as a consequence. I have also confirmed that it has nothing to do with the fact that thread 0 starts first or anything like that. I even tried to add a spinwait mechanism to check whether the rest of the threads start doing their work at least once but no luck. I realize that there maybe a lot of rookie mistakes but since I am on Windows, I can’t debug cuda programs.
1