I have seriously stucked on the problem of single-word translation within MarianMT model taken from HuggingFace. I’m currently developing Telegram bot for translation. For this reason I have chosen the MarianMT model. As a dataset for training I have chosen the famous paralell corpora from Europarl which supports different languages and has been written in formal style.
Now, I want to explain what I’ve done and which issues have I faced with:
First of all I want to describe which technologies I have used. The programming language is Python3 with the framework for deep learning called PyTorch. The model as I had above mentioned is MarianMT. I use different versions of MarianMT to handle multiple languages such as French, English, German etc.
Secondly I would like to describe my problem:
The problem is that when I use English-German language model, it doesn’t translate correctly or doesn’t translate an input word at all saying ‘Sorry but the translation for this language is not supported yet’. However if I type the same word but in German, it will correctly translate the word. Also, there is a problem with named entities like cities, countries, etc. For instance, if I type in English City of Düsseldorf is the capital of the state of NRW
the model will produce something like this:
City of485 ist die Hauptstadt des Bundesstaates Houston
which is really poor and incorrect.
Also it fails when translating such words like car, butter, Ukraine, Denmark, playground and other similar words regarded to countries, cities, subjects and even sometimes actions.
Thirdly I want to define the models and parameters that I have used:
Helsinki-NLP/opus-mt-en-de for English-German and German-English translation
Helsinki-NLP/opus-mt-en-fr for English-French translation
Helsinki-NLP/opus-mt-fr-en for French-English translation
Now about the parameters for training:
Framework: PyTorch latest + Google Colab Pro
Programming Language: Python 3.10
Dataset: Europarl paralell corpora
Number of epochs: 2
Loss function: Sparse Categorical CrossEntropy
Optimizer: Adam
Learning rate: 0.0001
Batch size: 32
So how can I overcome the problem that some of the words or sentences are either not translated at all or not correctly translated?