Is the approach of my cuda neural network way off?
I have developed quite a simple neural network in C. I then used openMP to run multiple instances of this network at the same time to learn batches of training data and the speedup was pretty decent. I guess im really just interested and find it fun so i tried porting this network to run on a GPU using Cuda. The way ive decided to distribute the workload is ……well im not sure if its good or not because its pretty fast however some reductions accross multiple blocks accounts for 90% of the time of the code. Ill describe what i have done and would like to hear if im way off in my approach: