I’m using a self-hosted Llama 3.3 70B (GGUF Q5) model for my project on Runpod with an H100 GPU. So far, it’s working great, but I’m wondering if it’s possible to set up a system that automatically scales resources when needed. Specifically, I’d like to scale up the throughput and handle more concurrent sessions during high-demand periods.
For example, with Kubernetes on DigitalOcean, you can configure autoscaling to add up to 10 nodes when there’s a spike in load. Is there anything similar for hosting LLMs? Has anyone managed to set up something like this with Runpod or a similar platform?