I was measuring the RAM that is used by my script and I was surprised that it takes about 300Mb of RAM, while the tokenizer file itself is about 9MB. Why is that?
I tried:
from transformers import AutoTokenizer
from memory_profiler import profile
@profile
def load_tokenizer():
path = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
tokenizer = AutoTokenizer.from_pretrained(path)
return tokenizer
load_tokenizer()
Output:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
4 377.4 MiB 377.4 MiB 1 @profile
5 def load_tokenizer():
6 377.4 MiB 0.0 MiB 1 path = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
7 676.6 MiB 299.2 MiB 1 tokenizer = AutoTokenizer.from_pretrained(path)
8
9
10 676.6 MiB 0.0 MiB 1 return tokenizer