I want to understand when a GPU executes a neural network, how the operations are mapped to the GPU’s hardware resources. I am familiar with the architecture of GPUs (especially NVIDIA) and I generally know how an NN is executed by them, but I do not know how to get to detailed and fine-grain scheduling of operations to the hardware resources and how the cores execute them. I am wondering if there is any tool or a set of tools for that.
To be more specific, let’s imagine that I have a pre-trained neural network in pytorch and want to run it on an NVIDIA 3090 GPU. How can I get the detailed scheduling of the operations (either at the MAC operations or neurons/channels/layers of the NN) to corresponding hardware resources via SMs or threads?