超越英语:利用大型语言模型实现包容可扩展的多语言机器翻译
Beyond English: Toward Inclusive and Scalable Multilingual Machine Translation with LLMs
November 10, 2025
作者: Yingfeng Luo, Ziqiang Xu, Yuxuan Ouyang, Murun Yang, Dingyang Lin, Kaiyan Chang, Tong Zheng, Bei Li, Peinan Feng, Quan Du, Tong Xiao, Jingbo Zhu
cs.AI
摘要
大型语言模型显著推动了多语言机器翻译(MMT)的发展,但广泛的语言覆盖范围、稳定的翻译质量以及英语中心化偏差仍是亟待解决的挑战。为应对这些挑战,我们推出LMT——一套以中英双语为核心的大规模多语言翻译模型,覆盖60种语言及234个翻译方向。在研发过程中,我们发现了一种被长期忽视的方向性退化现象:对称多向微调数据过度侧重反向翻译(X→英/中),导致过多多对一映射并降低翻译质量。为此提出战略性降采样策略,通过简单而有效的方法缓解此类退化。此外,我们设计出并行多语言提示技术(PMP),利用类型学相关的辅助语言增强跨语言迁移能力。通过严格的数据筛选与精细化适配策略,LMT在同等语言覆盖规模的模型中实现最优性能,其中40亿参数模型(LMT-60-4B)以显著优势超越参数规模更大的Aya-101-13B和NLLB-54B模型。我们发布四个参数规模版本(6亿/17亿/40亿/80亿)的LMT模型,旨在为包容性、可扩展的高质量MMT研究提供强力基线\href{https://github.com/NiuTrans/LMT}{https://github.com/NiuTrans/LMT}}。
English
Large language models have significantly advanced Multilingual Machine Translation (MMT), yet the broad language coverage, consistent translation quality, and English-centric bias remain open challenges. To address these challenges, we introduce LMT, a suite of Large-scale Multilingual Translation models centered on both Chinese and English, covering 60 languages and 234 translation directions. During development, we identify a previously overlooked phenomenon of directional degeneration, where symmetric multi-way fine-tuning data overemphasize reverse directions (X to En/Zh), leading to excessive many-to-one mappings and degraded translation quality. We propose Strategic Downsampling, a simple yet effective method to mitigate this degeneration. In addition, we design Parallel Multilingual Prompting (PMP), which leverages typologically related auxiliary languages to enhance cross-lingual transfer. Through rigorous data curation and refined adaptation strategies, LMT achieves SOTA performance among models of comparable language coverage, with our 4B model (LMT-60-4B) surpassing the much larger Aya-101-13B and NLLB-54B models by a substantial margin. We release LMT in four sizes (0.6B/1.7B/4B/8B) to catalyze future research and provide strong baselines for inclusive, scalable, and high-quality MMT \href{https://github.com/NiuTrans/LMT{https://github.com/NiuTrans/LMT}}.