The Whisper tokenizer does not work “as expected” for the Armenian language – instead of getting just a few tokens (having a few letters) I got more that 20 tokens (having just 1-2 letters).
Text sample:
- English: “I wanna dance”
- Armenian: “Ես ուզում եմ պարել”
Note: the Armenian language is officially supported by Whisper.
Code sample:
from transformers import WhisperTokenizer
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-base")
tokens_en = tokenizer.tokenize("I wanna dance")
print(tokens_en)
# ['I', 'Ġwanna', 'Ġdance']
tokens_hy = tokenizer.tokenize("Ես ուզում եմ պարել")
print(tokens_hy)
# ['Ô', 'µ', 'Õ', '½', 'ĠÕ', '¸', 'ÖĤ', 'Õ', '¦', 'Õ¸ÖĤ', 'Õ', '´', 'ĠÕ', '¥', 'Õ', '´', 'ĠÕ', 'º', 'Õ¡', 'ÖĢ', 'Õ¥Õ', '¬']
I tried booth WhisperTokenizer
and WhisperTokenizerFast
, but got the same result.
So, the question is how can I train the WhisperTokenizer
on my own Armenian text corpus to improve tokenization?
Levon Hakhoyan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
1