How to load the llama-2 8 bit quantized model onto a Single GPU
I have built a document Question-answering system using LLAMA-2 8 Bit quantized model, recently I migrated the Project from old system to a new system which has 2 X Nvidia RTX A6000 48GB each.
When I run the model, it gets splitted into 2 parts and loads onto separate GPU, which increases the response time.