I am trying load Llama-2-13b on multiple GPU’s but isn’t loading, i have 3 GPU’s 24.169 GB each , but unable to load, i have tried using cuda or device_map =’auto’
This is my current code. When I try nvidia-smi in terminal, the GPU is always at 0%.When i remove split options then it works, but then it runs on CPU.
here’s below my try:
import json
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM,pipeline
device_map = 'auto'
# Load the tokenizer and model from Hugging Face hub
access_token = token
model = "meta-llama/Llama-2-13b-hf"
tokenizer = AutoTokenizer.from_pretrained(model,token=access_token)
documents = """
{
"abovegradefinishedarea": 1215.0,
"bathroomsfull": 2,
"bathroomstotalinteger": 2,
"bedroomstotal": 3,
"yearbuilt": 1909,
"city": "Minneapolis",
"closedate": "2022-09-23",
}
"""
question = 'what is the name of the city?n'
input = f"""
<<SYS>>
Only respond with "Not in the text." if the information needed to answer the question is not contained in the document. n
Answer the question using only the information from the provided information below. n
Ensure that the questions are answered fully and effectively. n
Respond in short and concise yet fully formulated sentences, being precise and accurate
<</SYS>>
[INST]
User:{question}
[/INST]
[INST]
User:{documents}
[/INST]n
Assistant:
"""
llama_pipeline = pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map = 'auto',
temperature=0.1,
)
sequences = llama_pipeline(
input,
do_sample=True,
top_k=50,
num_return_sequences=2,
max_new_tokens=2048,
return_full_text=False,
temperature=0.1,
)
print("Chatbot:", sequences[0]['generated_text'])