I have developed quite a simple neural network in C. I then used openMP to run multiple instances of this network at the same time to learn batches of training data and the speedup was pretty decent. I guess im really just interested and find it fun so i tried porting this network to run on a GPU using Cuda. The way ive decided to distribute the workload is ……well im not sure if its good or not because its pretty fast however some reductions accross multiple blocks accounts for 90% of the time of the code. Ill describe what i have done and would like to hear if im way off in my approach:
-
Network designed to be scalable so any amount of inputs and any amount of neurons in hidden layers etc.
-
lets say the input is 784 floats,(mnist) and the first hidden layer is 1024. Ive decided to assign a maximum of 8 input floats to each thread. 784 will be padded to 1024 to simplify things as you will see later
-
to compute the 1st neuron in the hidden layer i will need 1024/8 neuron (128). To compute the full hidden layer if 1024 neurons i would need 128*1024 neurons = 131072 threads. This may seem like a lot and it might just be? but i have based this on the following reasons:
3a. Run as many SM’s for large tasks
3b. Each SM of the GPU will run 16 blocks
3c. Each block will contain 64 threads and use 4096KB of shared memory. This gives me close to maximum occupancy -
Each Block load 512 floats of the input layer and 512 weight values from global memory in a coalesced fashion. the 512 is limited due to shared memory
-
Because input is more than 512 floats, i use two blocks to load the entire input in.
-
matrix multiplication avoiding bank conflicts of shared memory.
-
Each block performs reduction of 64 threads
-
Each block then writes its reduced value to global memory where two blocks work in pairs to add their results to give the first neuron units value. This continues till all blocks have written their values to global memory.
Everything works really well till point 7. Comparing it to CPU network its a speedup of around 60x when using large layers. However when i get to step 8 the kernal just shits itself and becomes way slower then the CPU code. the difficulty lies in that i am not sure how to reduce two blocks together as they do not share any memory besides global.
So firstly i am not sure if the granularity of the network is way too fine, i.e. each thread has way too little work to do, or if its somewhat of an ok approach. I have done some searching of what others have implemented and cant find too much info on this.
Below is the kernel if this helps clarify implementation:
__global__ void ffH1v3(int* GPUbatchCycle, float* GPUtrn, float* GPUh1, float* GPUs1, float* GPUw1{
unsigned int globalID = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tid = threadIdx.x;
unsigned int i;
unsigned int j;
unsigned int stride;
int reduceIndex;
unsigned int batchCycle = 0;// *GPUbatchCycle;
unsigned int trnIndex;
unsigned int remainder;
extern __shared__ float sharedMem[];
float* inputs = sharedMem;
float* weights = sharedMem + (SHR_MEM_PBLK / sizeof(float) / 2);
trnIndex = ((batchCycle * SETSIZE) + (globalID * FP32_PTHR) + 0);
if (trnIndex >= ((batchCycle * SETSIZE) + (SETSIZE))){
remainder = (trnIndex/SETSIZE);
trnIndex = trnIndex - (((remainder- batchCycle) * SETSIZE));
}
//load inputs
for (i = 0; i < FP32_PTHR; i++) {
inputs[(i * THR_PBLK) + tid] = GPUtrn[trnIndex + i];
}
//load weights
for (i = 0; i < FP32_PTHR; i++) {
weights[(i * THR_PBLK) + tid] = GPUw1[((batchCycle * THR_PBLK * FP32_PTHR) + (globalID * FP32_PTHR) + i)];
}
//matric multiplication
inputs[tid] = (inputs[tid] * weights[tid]);
for (i = 1; i < FP32_PTHR; i++) {
inputs[tid] += (inputs[(i * THR_PBLK) + tid] * weights[(i * THR_PBLK) + tid]);
}
//individual block reduction
for (int stride = 1; stride < (THR_PBLK); stride *= 2) {
reduceIndex = 2 * stride * tid;
if (reduceIndex < (THR_PBLK)) {
inputs[reduceIndex] += inputs[reduceIndex + stride];
}
__syncthreads();
}
//The dreaded step nr 8 where i have to make blocks co-operate
if (tid == 0) {
for (i = 0; i < ((SETSIZE * N_H1) / FP32_PBLK); i++) {
if ((blockIdx.x >= (i* BLK_PNU_PAD)) && (blockIdx.x < ((i* BLK_PNU_PAD) + BLK_PNU_PAD))) {
if (blockIdx.x == i) {
atomicExch(&GPUh1[i], inputs[0]);
//printf("IF GPUh1[,%d,] = ,%.9f,n", i, GPUh1[i]);
}
else if (blockIdx.x > i) {
//printf("block %d is heren", blockIdx.x);
for (j = 1; j < BLK_PNU_PAD; j++) {
atomicAdd(&GPUh1[i], inputs[0]);
//printf("eLSE GPUh1[,%d,] = ,%.9f,n", i, GPUh1[i]);
}
}
}
}
}
__syncthreads();
}
The code I have written is scalable in every way. I can assign 512 threads per block as an example, reduce the memory each block uses etc and it seamlessly, or at least I think so, adjusts everything as needed to redistribute the work load. for example if I specify that I want to use 128 threads per block then the amount of shared memory reduces for each block and each thread proportionally, each thread then deals with less inputs etc. I have used the Nsight profiling tool and i get quite good results from what I can tell. When I started it was saying that I can get 95% speedup due to uncoalesced global accesses, I was using too many registers for thread and there were multiple bank conflicts. I fixed all that and ended up with 95% occupancy, compute throughput of 68% and L1 hit rate was in the 90% range. Anyways I’m not saying it was great as I do not know that much about this however I think it was reasonable from timing and performance until I included step 8. reducing blocks to global memory. So do you think i need to go back to the drawing board and completely change my approach?