擴展 FLORES+ 基準以適應更多低資源環境：葡萄牙語-埃馬克瓦機器翻譯評估

摘要

作為開放語言數據倡議的共享任務的一部分，我們擴展了 FLORES+ 評估集，包括 Emakhuwa，這是莫桑比克廣泛使用的低資源語言。我們將 dev 和 devtest 集從葡萄牙語翻譯成 Emakhuwa，並詳細描述了翻譯過程和使用的質量保證措施。我們的方法包括各種質量檢查，包括後編輯和適用性評估。結果數據集包括每個源語句的多個參考句子。我們提出了訓練神經機器翻譯系統和微調現有多語言翻譯模型的基準結果。我們的研究結果表明，在 Emakhuwa 中，拼寫不一致仍然是一個挑戰。此外，基準模型在這個評估集上表現不佳，突顯了需要進一步研究以提高 Emakhuwa 機器翻譯質量的必要性。數據可在 https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES 公開獲取。

English

As part of the Open Language Data Initiative shared tasks, we have expanded the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely spoken in Mozambique. We translated the dev and devtest sets from Portuguese into Emakhuwa, and we detail the translation process and quality assurance measures used. Our methodology involved various quality checks, including post-editing and adequacy assessments. The resulting datasets consist of multiple reference sentences for each source. We present baseline results from training a Neural Machine Translation system and fine-tuning existing multilingual translation models. Our findings suggest that spelling inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline models underperformed on this evaluation set, underscoring the necessity for further research to enhance machine translation quality for Emakhuwa. The data is publicly available at https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES.

擴展 FLORES+ 基準以適應更多低資源環境：葡萄牙語-埃馬克瓦機器翻譯評估

Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation

摘要

Support