ChatPaper.aiChatPaper

Mutarjim:利用小型语言模型推进阿拉伯语-英语双向翻译

Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model

May 23, 2025
作者: Khalil Hennara, Muhammad Hreden, Mohamed Motaism Hamed, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan
cs.AI

摘要

我们推出Mutarjim,一款紧凑而强大的阿拉伯语-英语双向翻译语言模型。尽管大规模语言模型(LLMs)在包括机器翻译在内的自然语言处理任务中展现了显著进展,但小型模型同样具有潜力。基于这一洞察,我们以专为阿拉伯语和英语设计的Kuwain-1.5B语言模型为基础,开发了Mutarjim。尽管模型规模适中,Mutarjim通过优化的两阶段训练方法和精心筛选的高质量训练语料,在多个权威基准测试中超越了更大规模的模型。实验结果表明,Mutarjim在显著降低计算成本和训练需求的同时,能够与规模大至20倍的模型相媲美。此外,我们引入了Tarjama-25,这是一个旨在克服现有阿拉伯语-英语基准数据集局限性的新基准,如领域狭窄、句子长度短以及英语源偏倚等问题。Tarjama-25包含5000对经过专家审阅的句子对,覆盖广泛领域,提供了一个更为全面和平衡的评估框架。值得注意的是,Mutarjim在Tarjama-25的英语到阿拉伯语任务中实现了最先进的性能,甚至超越了GPT-4o mini等显著更大且专有的模型。我们公开了Tarjama-25,以支持未来研究并推动阿拉伯语-英语翻译系统的评估进步。
English
We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus.. Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.

Summary

AI-Generated Summary

PDF1976May 27, 2025