I’ve tried loading Huggingface transformers models to MPS in two different ways:
llm = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype=torch.float16,
device_map="mps",
token=True,
)
and
llm = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype=torch.float16,
token=True,
)
llm = llm.to("mps")
Strangely, the first shows as having ~16 GB of memory usage on activity monitor, whereas the second shows as having ~20 GB of memory usage. I am wondering why this discrepancy exists.
When I call torch.mps.current_allocated_memory()
, both methods produce the same value (~16 GB of memory in mps). So, I suspect that, for some reason, memory is being kept on the CPU in the second method. I tried triggering garbage collection manually, but this did not fix the issue.
Does anyone know what causes this discrepancy in memory usage (and if there is a way to remove it)?
Thank you!