I’m attempting to fine-tune the NLLB model "facebook/nllb-200-distilled-600M"
for a scientific translation task from English (eng_Latn) to German (deu_Latn). I followed the official guidelines for fine-tuning by authors of nllb.
Documentation: link
This is the code block which is giving error:-
DATA_CONFIG = "/content/sample_data/data_config.json"
OUTPUT_DIR = "/content/outputs"
MODEL_FOLDER = "/content/drive/MyDrive/Thesis/nllb-checkpoints"
DROP = 0.1
SRC = "eng_Latn"
TGT = "deu_Latn"
!python /content/fairseq/examples/nllb/modeling/train/train_script.py
cfg=nllb200_dense3.3B_finetune_on_fbseed
cfg/dataset=default
cfg.dataset.lang_pairs="$SRC-$TGT"
cfg.fairseq_root=$(pwd)
cfg.output_dir=$OUTPUT_DIR
cfg.dropout=$DROP
cfg.warmup=10
cfg.finetune_from_model=$MODEL_FOLDER/checkpoint.pt
This is the error:-
/content/fairseq/examples/nllb/modeling/train/train_script.py:287: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
@hydra.main(config_path="conf", config_name="base_config")
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
TRAINING DIR: /content/outputs
Error executing job with overrides: ['cfg=nllb200_dense3.3B_finetune_on_fbseed', 'cfg/dataset=default', 'cfg.dataset.lang_pairs=eng_Latn-deu_Latn', 'cfg.fairseq_root=/content', 'cfg.output_dir=/content/outputs', 'cfg.dropout=0.1', 'cfg.warmup=10', 'cfg.finetune_from_model=/content/drive/MyDrive/LASS_KG_Data/Thesis/nllb-checkpoints/checkpoint.pt']
Traceback (most recent call last):
File "/content/fairseq/examples/nllb/modeling/train/train_script.py", line 289, in main
train_module = TrainModule(config)
File "/content/fairseq/examples/nllb/modeling/train/train_script.py", line 122, in __init__
assert cluster_name in cfg.dataset.data_prefix
omegaconf.errors.ConfigAttributeError: Key 'data_prefix' is not in struct
full_key: cfg.dataset.data_prefix
object_type=dict
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
So far, I understand there is a Missing data_prefix configuration
. I created a demo custom data_config.json. Which looks like this,
{
"data_prefix": "/content/sample_data",
"train_data": "train_demo.json",
"test_data": "test_demo.json",
"lang_pairs": "eng_Latn-deu_Latn"
}
While the official documentation provides some information, I’m encountering difficulties in applying it to my specific use case. Can someone share a detailed guide or point me to helpful resources on fine-tuning NLLB?