I want to perform offline batch inference with a model that is too large to fit into one GPU. I want to use tensor parallelism for this. Previously I have used vLLM
for batch inference. However, now I have a custom model that does not fit into vLLM
‘s offered architecture.
My whole stack is build on top of ray
, so I would like to distribute tensor shards across workers in ray
and perform inference. So far, it seems using the plain map_batches
API, the workers would replicate a entire model on each worker, which will yield OOM. This for example done in this tutorial:
https://medium.com/kocdigital/scalable-batch-inference-on-large-language-models-using-ray-ac3bf1cf1384
Now what is the best workflow to run batch inference for a custom model using parallelism (or any other technique that avoids fitting the entire model on one gpu) ?