What is the purpose of the add_prefix_space
and how to know which models would require it?
model_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)
Searched HuggingFace but not clear what it is.
- class tokenizers.pre_tokenizers.ByteLevel
add_prefix_space (bool, optional, defaults to True) — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello.
add_prefix_space=True
option for the BPE tokenizer- BPE tokenizers and spaces before words
In GPT2 and Roberta tokenizers, the space before a word is part of a word, i.e. “Hello how are you puppetter” will be tokenized in [“Hello”, “Ġhow”, “Ġare”, “Ġyou”, “Ġpuppet”, “ter”]. You can notice the spaces included in the words a Ġ here. Spaces are converted in a special character (the Ġ ) in the tokenizer prior to BPE splitting mostly to avoid digesting spaces since the standard BPE algorithm used spaces in its process (this can seem a bit hacky but was in the original GPT2 tokenizer implementation by OpenAI).
The code was taken from Fine-Tuning Large Language Models (LLMs).