扩展FLORES+基准,以适用于更多低资源环境:葡萄牙语-Emakhuwa机器翻译评估
Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation
August 21, 2024
作者: Felermino D. M. Antonio Ali, Henrique Lopes Cardoso, Rui Sousa-Silva
cs.AI
摘要
作为开放语言数据倡议的共享任务的一部分,我们扩展了FLORES+评估集,包括Emakhuwa,这是莫桑比克广泛使用的低资源语言。我们将dev和devtest集从葡萄牙语翻译成Emakhuwa,并详细介绍了翻译过程和质量保证措施。我们的方法包括各种质量检查,包括后编辑和充分性评估。最终的数据集包括每个源句的多个参考句子。我们展示了训练神经机器翻译系统和微调现有多语言翻译模型的基线结果。我们的研究结果表明,在Emakhuwa中,拼写不一致仍然是一个挑战。此外,基线模型在这个评估集上表现不佳,强调了需要进一步研究以提高Emakhuwa机器翻译质量的必要性。数据可在https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES 公开获取。
English
As part of the Open Language Data Initiative shared tasks, we have expanded
the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely
spoken in Mozambique. We translated the dev and devtest sets from Portuguese
into Emakhuwa, and we detail the translation process and quality assurance
measures used. Our methodology involved various quality checks, including
post-editing and adequacy assessments. The resulting datasets consist of
multiple reference sentences for each source. We present baseline results from
training a Neural Machine Translation system and fine-tuning existing
multilingual translation models. Our findings suggest that spelling
inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline
models underperformed on this evaluation set, underscoring the necessity for
further research to enhance machine translation quality for Emakhuwa. The data
is publicly available at https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES.Summary
AI-Generated Summary