I’m trying to implement a multi-gpu local server with ray and vllm. I have uploaded my full code and commands to this github repository. In short, I want to serve a big model that requires 2 gpus, but it can only use 1. I have made sure that my cuda env is in good shape, and that both gpus are detectable by torch. Tahnks in advance for any help.
@serve.deployment(ray_actor_options={"num_gpus": 2})
class VLLMPredictDeployment:
...