Hunyuan-MT技術レポート

要旨

本報告書では、我々が初めてオープンソース化した多言語翻訳モデル「Hunyuan-MT-7B」を紹介します。このモデルは33の主要言語間での双方向翻訳をサポートし、特に標準中国語と複数の少数民族言語および方言間の翻訳に重点を置いています。さらに、多様な翻訳シナリオに対応し、テスト時のモデル性能を向上させるため、スローシンキングモードに着想を得た翻訳モデル「Hunyuan-MT-Chimera-7B」を導入しました。このモデルは、Hunyuan-MT-7Bモデルが異なるパラメータ設定下で生成した複数の出力を統合することで、従来のChain-of-Thought（CoT）に基づくスローシンキングモデルを上回る性能を実現しています。我々のモデル開発は、多言語翻訳に特化した包括的なトレーニングプロセスに従っており、基礎能力を構築するための一般的かつMT指向の事前学習から始まり、タスク固有の適応のための教師ありファインチューニング（SFT）を経て、強化学習（RL）および弱から強へのRLによる高度なアラインメントで完結します。包括的な実験を通じて、Hunyuan-MT-7BとHunyuan-MT-Chimera-7Bの両方が、同等のパラメータサイズを持つ翻訳専用モデルおよびほとんどのSOTA大規模モデルを大幅に上回り、特に標準中国語と少数民族言語および方言間の翻訳タスクにおいて優れた性能を示すことを実証しました。WMT2025共有タスク（一般機械翻訳）において、我々のモデルは31言語ペア中30で首位を獲得し、最先端の性能を示しました。この結果は、中国語、英語、日本語などの高リソース言語から、チェコ語、マラーティー語、エストニア語、アイスランド語などの低リソース言語まで、多様な言語スペクトルにわたる我々のモデルの堅牢性を強調しています。

English

In this report, we introduce Hunyuan-MT-7B, our first open-source multilingual translation model, which supports bidirectional translation across 33 major languages and places a special emphasis on translation between Mandarin and several ethnic minority languages as well as dialects. Furthermore, to serve and address diverse translation scenarios and enhance model performance at test time, we introduce Hunyuan-MT-Chimera-7B, a translation model inspired by the slow thinking mode. This model integrates multiple outputs generated by the Hunyuan-MT-7B model under varying parameter settings, thereby achieving performance superior to that of conventional slow-thinking models based on Chain-of-Thought (CoT). The development of our models follows a holistic training process specifically engineered for multilingual translation, which begins with general and MT-oriented pre-training to build foundational capabilities, proceeds to Supervised Fine-Tuning (SFT) for task-specific adaptation, and culminates in advanced alignment through Reinforcement Learning (RL) and weak-to-strong RL. Through comprehensive experimentation, we demonstrate that both Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B significantly outperform all translation-specific models of comparable parameter size and most of the SOTA large models, particularly on the task of translation between Mandarin and minority languages as well as dialects. In the WMT2025 shared task (General Machine Translation), our models demonstrate state-of-the-art performance, ranking first in 30 out of 31 language pairs. This result highlights the robustness of our models across a diverse linguistic spectrum, encompassing high-resource languages such as Chinese, English, and Japanese, as well as low-resource languages including Czech, Marathi, Estonian, and Icelandic.

Hunyuan-MT技術レポート

Hunyuan-MT Technical Report

要旨

Support