全语种机器翻译:支持1,600种语言的机器翻译系统
Omnilingual MT: Machine Translation for 1,600 Languages
March 17, 2026
作者: Omnilingual MT Team, Belen Alastruey, Niyati Bafna, Andrea Caciolai, Kevin Heffernan, Artyom Kozhevnikov, Christophe Ropers, Eduardo Sánchez, Charles-Eric Saint-James, Ioannis Tsiamas, Chierh Cheng, Joe Chuang, Paul-Ambroise Duquenne, Mark Duppenthaler, Nate Ekberg, Cynthia Gao, Pere Lluís Huguet Cabot, João Maria Janeiro, Jean Maillard, Gabriel Mejia Gonzalez, Holger Schwenk, Edan Toledo, Arina Turkatenko, Albert Ventayol-Boada, Rashel Moritz, Alexandre Mourachko, Surya Parimi, Mary Williamson, Shireen Yates, David Dale, Marta R. Costa-jussà
cs.AI
摘要
高品質機器翻譯能夠擴展至數百種語言,為多語言系統設立了高標準。然而相較全球現存的7000種語言,現有系統的覆蓋範圍仍相當有限:目標端僅支持約200種語言,源語言端或許能通過跨語言遷移技術擴展至數百種。由於缺乏可靠的基準測試與評估指標,這些數字的準確性至今難以驗證。
我們提出全語種機器翻譯系統——首個支持超過1600種語言的機器翻譯架構。此規模的實現得益於綜合性數據策略,該策略整合了大規模公開多語料庫與新建數據集,包括人工校對的MeDLEY雙語語料。
我們探索了兩種針對機器翻譯任務適配大語言模型的技術路徑:解碼器專用模型與編碼器-解碼器架構中的功能模塊。值得注意的是,所有參數量從1B到8B的模型均達到或超越70B參數大語言模型的基線性能,展現出明顯的專業化優勢,並能在低算力環境下實現優質翻譯。針對英語至1600種語言的翻譯評估進一步表明:基線模型雖能解析低資源語言,但往往無法生成具有意義保真度的譯文;而OMT-LLaMA模型顯著擴展了可實現連貫生成的語言範圍。此外,OMT模型在跨語言遷移方面取得突破,針對評估的1600種語言已接近解決機器翻譯中「理解」層面的難題。我們的排行榜與核心人工評測數據集正持續向全語種方向動態演化,並免費開放使用。
English
High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics.
We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext.
We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.