I need to share my thoughts about efficient translation of po files using LLMs…
Maybe other guys did have the same idea and did go further?
I am looking for a way to translate po files based on an existing translated one to help disambiguate translations.
For example, if you provide a po file with
msgid "Bank"
msgstr ""
No tool, no matter how intelligent, is ever going to know if it’s talking about a financial company or the banks of a river.
But if you provide a French translation:
msgid: "Bank"
msgstr: "Banque"
then any translation capable LLM should be able to translate that po file entry into Spanish or German or whatever language it knows.
Did anybody ever think about this?
Actually, with the help of Claude and ChatGPT (free versions 😉 I tried this idea myself and ended up with this pretty simple piece of Python code that:
- loads a LLM specialized in translation (facebook/mbart-large-50-many-to-many-mmt),
- reads the po file entry by entry
- tries to provide a translation of the English msgid using the French translation as context:
import polib
from transformers import pipeline
def translate_po_file(input_file, output_file):
# Load multilingual translation template
translator = pipeline("translation", model="facebook/mbart-large-50-many-to-many-mmt")
# Load input .po file
po = polib.pofile(input_file)
# Browse each entry and translate
for entry in po:
if entry.msgid and not entry.fuzzy:
# Preparing the context and the text to be translated
context = entry.msgstr if entry.msgstr else entry.msgid
text_to_translate = entry.msgid
# Building the prompt
prompt = f"Translate to Spanish. Context: {context}nText: {text_to_translate}"
# Translate into Spanish
translation = translator(prompt, src_lang="en_XX", tgt_lang="es_XX")[0]['translation_text']
# Extract the translated part (after "Text: ")
translation = translation.split("Text: ")[-1].strip()
# Update translation
entry.msgstr = translation
# Save the new .po file
po.save(output_file)
# Example of use
input_file = "input.po"
output_file = "output.es.po"
translate_po_file(input_file, output_file)
You’ll need to `pip install’ several libraries before running this code (I’ve done several tests so I’m not sure they’re all still needed).
polib
transformers
torch
sentencepiece
sacremoses
protobuf
Unfortunately, this does not work very well.
I think I have to
- deal with the placeholders included in the po files (e.g. {some_variable} or %s or %(some_variables)s …)
- and probably provide a much better prompt to explain to the model how to use the context for translation…
I had a look on github, where there are a lot of “gpt for po” projects, but I didn’t find any that use an already done translation as disambiguation context, although I think this is absolutely key for po, where sentences are very short (even just one word) and thus don’t provide enough context for the translator to work properly…
Any feedback welcome…