I am training my own tokenizer based on bert-based-cased
. The problem I have is that in my data (dead language), there are tokens that begin with =
and this should not be split off from the rest of the token. How do I achieve that?
Thanks for your help!