I am trying to finetune llama2 on a self-created dataset. The text in the dataset is quite long, so I used rope_scaling=8 for llama initialization. I also used 4-bit and qlora to save memory. However, in the end, I found that after I created the model and tokenizer, 30 GB of the GPU RAM had already been used and I cannot train with even batch=1. I tried to load some published models on huggingface and it turns out to be fine, they will only consume up to 10 GB of RAM as expected.
The code that takes 30 GB of memory is the following:
model_config = LlamaConfig(max_position_embeddings=4096,
rope_scaling={"type": "linear",
"factor": 8.0},
use_cache=False)
model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", config=model_config, device_map="auto", trust_remote_code=True)
And I checked the RAM situation after running such code:
CPU RAM Free: 1.0 TB, GPU 0 … Mem Free: 19319MB / 46068MB | Utilization 57%