If functions in pytorch are already fairly optimized for parallelization on a GPU, what is the benefit to learning CUDA, if I’m not planning on using it to perform pretraining on llms, but would only use it to optimize training for smaller deep neural nets. Are the net gains of time optimization worth the net pains of learning?
Currently going through Programming Massively Parallel Processes by by Wen-mei W. Hwu, David B. Kirk, Izzat El Hajj to learn, any other recommendations greatly appreciated.
sassafrasar is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.