ChatPaper.aiChatPaper

《混元机器翻译技术报告》

Hunyuan-MT Technical Report

September 5, 2025
作者: Mao Zheng, Zheng Li, Bingxin Qu, Mingyang Song, Yang Du, Mingrui Sun, Di Wang
cs.AI

摘要

在本报告中,我们介绍了首个开源的多语言翻译模型——Hunyuan-MT-7B,该模型支持33种主要语言之间的双向翻译,并特别关注普通话与多种少数民族语言及方言之间的互译。此外,为应对多样化的翻译场景并提升模型在测试时的性能,我们引入了受慢速思维模式启发的翻译模型Hunyuan-MT-Chimera-7B。该模型整合了Hunyuan-MT-7B在不同参数设置下生成的多个输出,从而实现了超越传统基于思维链(CoT)的慢速思维模型的性能。我们的模型开发遵循了专为多语言翻译设计的整体训练流程,从通用及面向机器翻译的预训练开始,奠定基础能力,继而通过监督微调(SFT)进行任务特定适应,最终通过强化学习(RL)及弱到强RL实现高级对齐。通过全面实验,我们证明Hunyuan-MT-7B和Hunyuan-MT-Chimera-7B在同等参数规模的翻译专用模型及大多数SOTA大模型上均表现优异,尤其是在普通话与少数民族语言及方言的翻译任务中。在WMT2025共享任务(通用机器翻译)中,我们的模型展现了顶尖性能,在31个语言对中的30个排名第一。这一成果凸显了我们的模型在涵盖高资源语言(如中文、英文、日文)及低资源语言(包括捷克语、马拉地语、爱沙尼亚语和冰岛语)在内的广泛语言谱系中的强大鲁棒性。
English
In this report, we introduce Hunyuan-MT-7B, our first open-source multilingual translation model, which supports bidirectional translation across 33 major languages and places a special emphasis on translation between Mandarin and several ethnic minority languages as well as dialects. Furthermore, to serve and address diverse translation scenarios and enhance model performance at test time, we introduce Hunyuan-MT-Chimera-7B, a translation model inspired by the slow thinking mode. This model integrates multiple outputs generated by the Hunyuan-MT-7B model under varying parameter settings, thereby achieving performance superior to that of conventional slow-thinking models based on Chain-of-Thought (CoT). The development of our models follows a holistic training process specifically engineered for multilingual translation, which begins with general and MT-oriented pre-training to build foundational capabilities, proceeds to Supervised Fine-Tuning (SFT) for task-specific adaptation, and culminates in advanced alignment through Reinforcement Learning (RL) and weak-to-strong RL. Through comprehensive experimentation, we demonstrate that both Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B significantly outperform all translation-specific models of comparable parameter size and most of the SOTA large models, particularly on the task of translation between Mandarin and minority languages as well as dialects. In the WMT2025 shared task (General Machine Translation), our models demonstrate state-of-the-art performance, ranking first in 30 out of 31 language pairs. This result highlights the robustness of our models across a diverse linguistic spectrum, encompassing high-resource languages such as Chinese, English, and Japanese, as well as low-resource languages including Czech, Marathi, Estonian, and Icelandic.
PDF103September 11, 2025