The whisper tokenizer does not work “as expected” for armenian language – instead of getting just a few tokens (having a few letters) I got more that 20 tokens (having just 1-2 letters).
Text sample
english: “I wanna dance”
armenian: “Ես ուզում եմ պարել”
Note: the armenian language is officially supported by whisper.
Code sample
from transformers import WhisperTokenizer
tokenizer = WhisperTokenizer.from_pretrained(“openai/whisper-base”)
tokens_en = tokenizer.tokenize(“I wanna dance”)
print(tokens_en)
// [‘I’, ‘Ġwanna’, ‘Ġdance’]
tokens_hy = tokenizer.tokenize(“Ես ուզում եմ պարել”)
print(tokens_hy)
// [‘Ô’, ‘µ’, ‘Õ’, ‘½’, ‘ĠÕ’, ‘¸’, ‘ÖĤ’, ‘Õ’, ‘¦’, ‘Õ¸ÖĤ’, ‘Õ’, ‘´’, ‘ĠÕ’, ‘¥’, ‘Õ’, ‘´’, ‘ĠÕ’, ‘º’, ‘Õ¡’, ‘ÖĢ’, ‘Õ¥Õ’, ‘¬’]
I tried booth WhisperTokenizer and WhisperTokenizerFast, but got the same result.
So, the question is how can I train the WhisperTokenizer on my own armenian text corpus to improve tokenization?
Thank you.
Levon Hakhoyan is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.