I try to fine tune llama2 language model with dataset that I created in Persian language. But when I tokenize this dataset I noticed that llama2 tokenizer tokenized dataset in character level not word level so it has not any meaning due to training.
Every time I run trainer this happen and I don’t know how to solve this. Is there any recommended language model to tokenize and fine tune with dataset in Persian.
New contributor
user23446017 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.