Good day!
I have AMD and Nvidia GPUs (4080/7900 xtx). I’d like to share one quantized model (70b parameters) between two video adapters to improve text generation speed.
The only solution I found was to use Vulcan and LLama.cpp, but it is still quite slow (probably due to the features of Vulcan). Tell me, are there alternative ways to share one LLM between GPUs from different vendors?
UPD:
What have I tried?
I tried to work with the compiled version of llama.cpp (with the -DLLAMA_VULKAN=1 flag), running on the Ubuntu 22.0.4 operating system and Vulkan SDK installed. The output of LLAMA 3 70b LLM (q3, q4) on the two specified GPUs was significantly (about 4 times) slower than running models that typically only run on CUDA (for example, cuda-based text-generation-webui with llama.cpp) . Even taking into account the fact that the model with Q3 quantization is located entirely in the video memory of two adapters.
I would like to know if there is another way to run text generation on two GPUs from different manufacturers (without using Vulkan SDK) that is faster?
Leonid Yusupov is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.