Only 4 tokens per second! Why running Llama 3.1 8B Instruct in NVIDIA V100 so slow? The entire code for reproducing the problem.