より低リソースの環境向けにFLORES+ベンチマークを拡張する：ポルトガル語-エマクワ語機械翻訳評価

要旨

Open Language Data Initiativeの共有タスクの一環として、FLORES+評価セットを拡張し、モザンビークで広く話されている低リソース言語であるEmakhuwaを含めました。我々は、ポルトガル語からEmakhuwaへのdevおよびdevtestセットの翻訳を行い、翻訳プロセスと品質保証措置を詳細に説明します。我々の方法論には、投稿編集や適合性評価を含むさまざまな品質チェックが含まれています。結果として得られたデータセットには、各ソースに複数の参照文が含まれています。我々は、ニューラル機械翻訳システムのトレーニングと既存の多言語翻訳モデルのファインチューニングからのベースライン結果を提示します。我々の調査結果から、Emakhuwaにおけるつづりの不一致が課題であることが示唆されます。さらに、ベースラインモデルはこの評価セットで性能が低かったことから、Emakhuwaの機械翻訳品質を向上させるためのさらなる研究の必要性が強調されます。データはhttps://huggingface.co/datasets/LIACC/Emakhuwa-FLORESで公開されています。

English

As part of the Open Language Data Initiative shared tasks, we have expanded the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely spoken in Mozambique. We translated the dev and devtest sets from Portuguese into Emakhuwa, and we detail the translation process and quality assurance measures used. Our methodology involved various quality checks, including post-editing and adequacy assessments. The resulting datasets consist of multiple reference sentences for each source. We present baseline results from training a Neural Machine Translation system and fine-tuning existing multilingual translation models. Our findings suggest that spelling inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline models underperformed on this evaluation set, underscoring the necessity for further research to enhance machine translation quality for Emakhuwa. The data is publicly available at https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES.