填补漏洞：多语言翻译强化学习中奖励机制滥用的缓解策略

摘要

大型语言模型（LLMs）在高资源语言对的机器翻译任务中展现出卓越能力，但在低资源翻译方面的表现仍显不足。现有后训练方法高度依赖高质量平行语料，而这类数据对于低资源语言往往稀缺甚至缺失。本文提出WALAR——一种仅需单语文本即可增强LLMs低资源语言翻译能力的强化训练方法，同时保持其在高资源语言上的性能。我们的核心洞见源于对现有基于源语的多语言质量评估（QE）模型失效模式（或称“漏洞”）的观察：使用这些QE模型进行强化学习（RL）会放大此类漏洞，导致多语言LLMs性能下降。我们开发了词对齐和语言对齐等技术，以消除WALAR强化学习奖励机制中的漏洞。通过WALAR对支持101种语言翻译的LLM进行持续训练，实验表明新模型在Flores-101数据集的1400个语言方向上大幅超越当前最强开源多语言LLM之一LLaMAX。

English

Large Language Models (LLMs) have demonstrated remarkable capability in machine translation on high-resource language pairs, yet their performance on low-resource translation still lags behind. Existing post-training methods rely heavily on high-quality parallel data, which are often scarce or unavailable for low-resource languages. In this paper, we introduce WALAR, a reinforcement training method using only monolingual text to elevate LLMs' translation capabilities on massive low-resource languages while retaining their performance on high-resource languages. Our key insight is based on the observation of failure modes (or "holes") in existing source-based multilingual quality estimation (QE) models. Reinforcement learning (RL) using these QE models tends to amplify such holes, resulting in poorer multilingual LLMs. We develop techniques including word alignment and language alignment to mitigate such holes in WALAR's reward for RL training. We continually trained an LLM supporting translation of 101 languages using WALAR. The experiments show that our new model outperforms LLaMAX, one of the strongest open-source multilingual LLMs by a large margin on 1400 language directions on Flores-101 dataset.

填补漏洞：多语言翻译强化学习中奖励机制滥用的缓解策略

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

摘要

Support