R1-T1:通過推理學習全面激發大型語言模型的翻譯能力
R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning Learning
February 27, 2025
作者: Minggui He, Yilun Liu, Shimin Tao, Yuanchang Luo, Hongyong Zeng, Chang Su, Li Zhang, Hongxia Ma, Daimeng Wei, Weibin Meng, Hao Yang, Boxing Chen, Osamu Yoshie
cs.AI
摘要
儘管近期在推理增強的大型語言模型(LLMs)如DeepSeek-R1上取得了突破,將推理時序的思考鏈(CoTs)融入機器翻譯(MT)——這一領域中人類譯者自然運用結構化、多層次推理的過程——仍未被充分探索。現有方法要么設計針對特定MT子任務(如文學翻譯)的固定CoT,要么依賴於合成與人類不對齊的CoTs並進行易於發生災難性遺忘的監督微調(SFT),這限制了它們對多樣化翻譯場景的適應性。本文介紹了R1-Translator(R1-T1),這是一種新穎的框架,旨在通過強化學習(RL)實現通用MT的推理時序推理,其中包含六種常見模式的人類對齊CoTs。我們的方法開創了三項創新:(1)將基於推理的翻譯擴展至MT子任務之外,涵蓋六種語言及多樣化任務(如法律/醫學領域適應、成語解析);(2)形式化六種由專家策劃的CoT模板,這些模板反映了混合的人類策略,如上下文感知的意譯和回譯;(3)通過帶有KL約束獎勵的RL,實現自我進化的CoT發現與抗遺忘適應。實驗結果表明,在Flores-101測試集上,21種語言和80個翻譯方向上翻譯性能穩步提升,尤其是在訓練中未見的15種語言上,與普通SFT相比,其通用多語言能力得以保持。
English
Despite recent breakthroughs in reasoning-enhanced large language models
(LLMs) like DeepSeek-R1, incorporating inference-time reasoning into machine
translation (MT), where human translators naturally employ structured,
multi-layered reasoning chain-of-thoughts (CoTs), is yet underexplored.
Existing methods either design a fixed CoT tailored for a specific MT sub-task
(e.g., literature translation), or rely on synthesizing CoTs unaligned with
humans and supervised fine-tuning (SFT) prone to catastrophic forgetting,
limiting their adaptability to diverse translation scenarios. This paper
introduces R1-Translator (R1-T1), a novel framework to achieve inference-time
reasoning for general MT via reinforcement learning (RL) with human-aligned
CoTs comprising six common patterns. Our approach pioneers three innovations:
(1) extending reasoning-based translation beyond MT sub-tasks to six languages
and diverse tasks (e.g., legal/medical domain adaptation, idiom resolution);
(2) formalizing six expert-curated CoT templates that mirror hybrid human
strategies like context-aware paraphrasing and back translation; and (3)
enabling self-evolving CoT discovery and anti-forgetting adaptation through RL
with KL-constrained rewards. Experimental results indicate a steady translation
performance improvement in 21 languages and 80 translation directions on
Flores-101 test set, especially on the 15 languages unseen from training, with
its general multilingual abilities preserved compared with plain SFT.Summary
AI-Generated Summary