I have built a document Question-answering system using LLAMA-2 8 Bit quantized model, recently I migrated the Project from old system to a new system which has 2 X Nvidia RTX A6000 48GB each.
When I run the model, it gets splitted into 2 parts and loads onto separate GPU, which increases the response time.
How should I mitigate it or Is there a way to make the model to run on a single GPU? Please suggest