Mutarjim：小規模言語モデルによるアラビア語-英語双方向翻訳の進展

要旨

我々は、アラビア語と英語の双方向翻訳に特化したコンパクトでありながら強力な言語モデル「Mutarjim」を紹介する。大規模なLLM（大規模言語モデル）は、機械翻訳を含む自然言語処理タスクにおいて目覚ましい進歩を遂げてきたが、より小規模なモデルにも注目が集まっている。この洞察を活かし、我々はアラビア語と英語の両方に特化した言語モデル「Kuwain-1.5B」を基にMutarjimを開発した。Mutarjimはその控えめなサイズにもかかわらず、最適化された二段階のトレーニングアプローチと厳選された高品質なトレーニングコーパスを通じて、いくつかの確立されたベンチマークでより大規模なモデルを凌駕する性能を発揮する。実験結果は、Mutarjimが最大20倍大きいモデルと同等の性能を発揮しながら、計算コストとトレーニング要件を大幅に削減することを示している。また、我々は既存のアラビア語-英語ベンチマークデータセットの課題（ドメインの狭さ、短い文の長さ、英語ソースの偏りなど）を克服するために設計された新しいベンチマーク「Tarjama-25」を導入する。Tarjama-25は、専門家によるレビューを受けた5,000の文ペアで構成され、幅広いドメインをカバーし、より包括的でバランスの取れた評価フレームワークを提供する。特に、MutarjimはTarjama-25の英語からアラビア語のタスクにおいて最先端の性能を達成し、GPT-4o miniのような大幅に大規模でプロプライエタリなモデルさえも上回る。我々は、今後の研究を支援し、アラビア語-英語翻訳システムの評価を進めるために、Tarjama-25を公開する。

English

We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus.. Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.