擴展 FLORES+ 基準以適應更多低資源環境: 葡萄牙語-埃馬克瓦機器翻譯評估
Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation
August 21, 2024
作者: Felermino D. M. Antonio Ali, Henrique Lopes Cardoso, Rui Sousa-Silva
cs.AI
摘要
作為開放語言數據倡議的共享任務的一部分,我們擴展了 FLORES+ 評估集,包括 Emakhuwa,這是莫桑比克廣泛使用的低資源語言。我們將 dev 和 devtest 集從葡萄牙語翻譯成 Emakhuwa,並詳細描述了翻譯過程和使用的質量保證措施。我們的方法包括各種質量檢查,包括後編輯和適用性評估。結果數據集包括每個源語句的多個參考句子。我們提出了訓練神經機器翻譯系統和微調現有多語言翻譯模型的基準結果。我們的研究結果表明,在 Emakhuwa 中,拼寫不一致仍然是一個挑戰。此外,基準模型在這個評估集上表現不佳,突顯了需要進一步研究以提高 Emakhuwa 機器翻譯質量的必要性。數據可在 https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES 公開獲取。
English
As part of the Open Language Data Initiative shared tasks, we have expanded
the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely
spoken in Mozambique. We translated the dev and devtest sets from Portuguese
into Emakhuwa, and we detail the translation process and quality assurance
measures used. Our methodology involved various quality checks, including
post-editing and adequacy assessments. The resulting datasets consist of
multiple reference sentences for each source. We present baseline results from
training a Neural Machine Translation system and fine-tuning existing
multilingual translation models. Our findings suggest that spelling
inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline
models underperformed on this evaluation set, underscoring the necessity for
further research to enhance machine translation quality for Emakhuwa. The data
is publicly available at https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES.Summary
AI-Generated Summary