全语种机器翻译：支持1,600种语言的机器翻译系统

摘要

高质量机器翻译(MT)已能覆盖数百种语言，为多语言系统设定了高标准。但相较于全球7000种语言，现有系统的覆盖范围仍十分有限：目标端约支持200种语言，借助跨语言迁移技术，源端或许能多支持几百种。由于缺乏可靠的基准指标，就连这些数字都难以准确评估。我们推出全语种机器翻译(OMT)系统，这是首个支持1600余种语言的机器翻译系统。此规模得益于综合数据策略的实现——该策略将大型公共多语言语料库与新创建的数据集（包括人工校对的MeDLEY双语语料）相融合。我们探索了大型语言模型(LLM)专用于机器翻译的两种路径：作为仅解码器模型(OMT-LLaMA)，或作为编码器-解码器架构中的模块(OMT-NLLB)。值得注意的是，我们所有10亿至80亿参数的模型均达到或超越700亿参数LLM基线的机器翻译性能，展现出明显的专业化优势，并能在低算力环境下实现强劲的翻译质量。此外，我们对英语至1600种语言翻译的评估进一步表明：基线模型虽能理解低资源语言，但常无法生成具有实质保真度的译文；OMT-LLaMA模型则大幅扩展了可生成连贯译文的语言范围。同时，OMT模型在跨语言迁移方面取得进展，近乎解决1600种语言评估中机器翻译"理解"环节的难题。我们的排行榜及主要人工评估数据集（BOUQuET与Met-BOUQuET）正动态向全语种方向演进，并免费开放。

English

High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.

全语种机器翻译：支持1,600种语言的机器翻译系统

Omnilingual MT: Machine Translation for 1,600 Languages

摘要

Support