Above is a diagram of the A100 SM. It shows that the SM is partitioned into four groups of cores, each of which executes 32-thread warps.
At any given time, how many warps are running on each sub-partition? One or multiple?
Suppose each thread in a warp must add two 64-bit floating point numbers. Since the sub-partition only has 8 FP64 units, we can’t do 32 FP64 ops in one clock cycle. What happens in this case? Do the 32 FP64 ops execute in groups of 8 over 4 clock cycles? Can the INT32 and FP32 units execute threads from other warps (or the same warp) while this happens?
If you could please suggest references (books, papers, articles, etc) to better understand this I would appreciate it.
Thanks!