ExTrans: 예시 강화 강화 학습을 통한 다국어 심층 추론 번역

초록

최근 OpenAI-o1 및 DeepSeek-R1과 같은 대형 추론 모델(LRMs)의 등장은 수학 및 코딩과 같은 복잡한 문제에서 인상적인 성능을 보여주고 있다. 일부 선구적인 연구들은 이러한 LRMs의 성공을 신경 기계 번역(MT)에 적용하려는 시도를 하고 있다. 이들은 강화 학습(RL)을 통해 깊은 추론 능력을 갖춘 MT용 LRMs를 구축하려고 한다. 일부 진전이 있었음에도 불구하고, 이러한 시도들은 일반적으로 영어와 중국어와 같은 고자원 언어에 초점을 맞추고 있어 다른 언어에서의 성능은 불분명하다. 또한, 기존 연구에서의 보상 모델링 방법은 MT에서 강화 학습의 잠재력을 완전히 발휘하지 못하고 있다. 본 연구에서는 먼저 정책 MT 모델의 번역 결과를 강력한 LRM(즉, DeepSeek-R1-671B)과 비교하고, 이를 정량화하여 보상을 제공하는 새로운 보상 모델링 방법을 설계한다. 실험 결과는 이 보상 모델링 방법의 우수성을 입증한다. Qwen2.5-7B-Instruct를 백본으로 사용하여 훈련된 모델은 문학 번역에서 새로운 최첨단 성능을 달성하며, OpenAI-o1 및 DeepSeek-R1을 포함한 강력한 LRMs를 능가한다. 더 나아가, 우리는 이 방법을 11개 언어로 구성된 다국어 설정으로 확장한다. RL에서 신중하게 설계된 경량 보상 모델링을 통해 단일 방향에서의 강력한 MT 능력을 다중(즉, 90개) 번역 방향으로 간단히 전이할 수 있으며, 인상적인 다국어 MT 성능을 달성한다.

English

In recent years, the emergence of large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, has shown impressive capabilities in complex problems, e.g., mathematics and coding. Some pioneering studies attempt to bring the success of LRMs in neural machine translation (MT). They try to build LRMs with deep reasoning MT ability via reinforcement learning (RL). Despite some progress that has been made, these attempts generally focus on several high-resource languages, e.g., English and Chinese, leaving the performance on other languages unclear. Besides, the reward modeling methods in previous work do not fully unleash the potential of reinforcement learning in MT. In this work, we first design a new reward modeling method that compares the translation results of the policy MT model with a strong LRM (i.e., DeepSeek-R1-671B), and quantifies the comparisons to provide rewards. Experimental results demonstrate the superiority of the reward modeling method. Using Qwen2.5-7B-Instruct as the backbone, the trained model achieves the new state-of-the-art performance in literary translation, and outperforms strong LRMs including OpenAI-o1 and DeepSeeK-R1. Furthermore, we extend our method to the multilingual settings with 11 languages. With a carefully designed lightweight reward modeling in RL, we can simply transfer the strong MT ability from a single direction into multiple (i.e., 90) translation directions and achieve impressive multilingual MT performance.

ExTrans: 예시 강화 강화 학습을 통한 다국어 심층 추론 번역

ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced Reinforcement Learning

초록

Support