Using Vertex ai custom container online predictions, i’m trying to deploy: meta-
llama/Meta-Llama-3-70B-Instruct
with vllm 0.4.1 on 8 NVIDIA_L4 gpus
and gettings:
/tmp/ray is over 95% full, available space: 5031063552; capacity:
101203873792. Object creation will fail if spilling is required.
this is the last log i see and after that deployment is failed
running the custom container on a VM had no issues,
To create the model i’m using google aiplatfrom sdk:
model_resource = aiplatform.Model.upload(
serving_container_image_uri=serving_container_image_uri,
serving_container_shared_memory_size_mb=16384,
...
)
and to load the model with vllm (code ran by the container):
from vllm import LLM
self.model = LLM(
model=model_config.model_hf_name,
dtype="auto",
tensor_parallel_size=model_config.tensor_parallel_size,
enforce_eager=model_config.enforce_eager,
disable_custom_all_reduce=model_config.disable_custom_all_reduce,
worker_use_ray=bool(model_config.tensor_parallel_size > 1),
enable_prefix_caching=False,
max_model_len=model_config.max_seq_len,
)