범어적 기계 번역: 1,600개 언어를 위한 기계 번역

초록

고품질 기계 번역(MT)은 수백 개의 언어로 확장 가능하여 다국어 시스템에 높은 기준을 제시합니다. 그러나 전 세계 7,000개 언어와 비교할 때 현재 시스템의 지원 범위는 여전히 제한적입니다: 목표 언어 약 200개, 그리고 교차 언어 전이 덕분에 지원되는 소스 언어는 수백 개 더 될 뿐입니다. 이러한 숫자조차도 신뢰할 수 있는 벤치마크와 측정 기준의 부재로 평가하기 어려웠습니다. 본 논문은 1,600개 이상의 언어를 지원하는 최초의 MT 시스템인 Omnilingual Machine Translation(OMT)을 소개합니다. 이러한 규모는 대규모 공공 다국어 코퍼스와 수동으로 정제된 MeDLEY 병렬 텍스트를 포함한 새로 생성된 데이터셋을 통합한 포괄적인 데이터 전략을 통해 가능해졌습니다. 우리는 대형 언어 모델(LLM)을 기계 번역에 특화시키는 두 가지 방식을 탐구합니다: 디코더 전용 모델(OMT-LLaMA)로 활용하거나 인코더-디코더 아키텍처의 모듈(OMT-NLLB)로 활용하는 방식입니다. 특히, 1B부터 8B 파라미터 규모의 모든 모델이 70B LLM 기준 모델의 MT 성능을 따라잡거나 능가하여, 명확한 특화 이점과 저사양 환경에서도 강력한 번역 품질을 가능하게 함을 보여줍니다. 더 나아가, 영어에서 1,600개 언어로의 번역 평가 결과는 기준 모델이 지원이 미흡한 언어를 해석할 수는 있지만 의미 있는 정확도로 생성하는 데는 자주 실패하는 반면, OMT-LLaMA 모델은 응집성 있는 생성이 가능한 언어 집합을 크게 확장합니다. 또한 OMT 모델은 교차 언어 전이 성능이 향상되어 평가 대상 1,600개 언어에 대한 MT의 '이해' 부분의 해결에 가까워졌습니다. 우리의 리더보드와 주요 인간 평가 데이터셋(BOUQuET 및 Met-BOUQuET)은 범세계적 언어 지원을 지향하며 동적으로 발전 중이며 자유롭게 이용 가능합니다.

English

High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.

범어적 기계 번역: 1,600개 언어를 위한 기계 번역

Omnilingual MT: Machine Translation for 1,600 Languages

초록

Support