オムニリンガルMT：1,600言語向け機械翻訳

要旨

高品質な機械翻訳（MT）は数百の言語に対応可能であり、多言語システムにおける高い基準を設けています。しかし、世界に7,000存在する言語と比較すると、現行システムの対応範囲は依然として限定的です。目標言語側は約200言語、クロスリンガル転移によりサポートされる原言語側は数百言語程度に留まっています。さらに、信頼性の高いベンチマークや評価指標が不足しているため、これらの数値すら適切に評価することが困難でした。本論文では、1,600以上の言語をサポートする初のMTシステムであるOmnilingual Machine Translation（OMT）を提案します。この規模の実現は、大規模な公開多言語コーパスと、手作業で精選されたMeDLEYバイテキストを含む新規作成データセットを統合した、包括的なデータ戦略によって可能となりました。我々は大規模言語モデル（LLM）を機械翻訳用に特殊化する2つの方法、デコーダのみのモデル（OMT-LLaMA）とエンコーダ・デコーダ構造におけるモジュール（OMT-NLLB）としての活用を検討しました。特筆すべきは、パラメータ数が1Bから8Bの全てのモデルが、70BのLLMベースラインのMT性能を匹敵または凌駕し、明確な特殊化の優位性を示し、低計算資源環境でも強力な翻訳品質を実現した点です。さらに、英語から1,600言語への翻訳評価により、ベースラインモデルは支援の少ない言語を解釈できても、意味のある忠実度で生成することは頻繁に失敗するのに対し、OMT-LLaMAモデルは首尾一貫した生成が可能な言語のセットを大幅に拡大することが示されました。加えて、OMTモデルはクロスリンガル転移においても改善を見せ、評価対象1,600言語におけるMTの「理解」部分の課題解決に迫っています。我々のリーダーボードと主要な人手作成評価データセット（BOUQuETおよびMet-BOUQuET）は、Omnilingualityに向けて動的に進化しており、自由に利用可能です。

English

High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.

オムニリンガルMT：1,600言語向け機械翻訳

Omnilingual MT: Machine Translation for 1,600 Languages

要旨

Support