RL Tango：协同强化生成器与验证器以提升语言推理能力

摘要

强化学习（RL）最近作为一种提升大型语言模型（LLMs）推理能力的有效方法崭露头角，其中LLM生成器作为由验证器（奖励模型）引导的策略。然而，当前针对LLMs的RL后训练方法通常采用固定的验证器（基于规则或预训练冻结）或通过监督微调（SFT）进行判别式训练。这类设计容易受到奖励欺骗的影响，且在训练分布之外泛化能力较差。为克服这些局限，我们提出了Tango，一个新颖的框架，利用RL以交替方式同时训练LLM生成器和验证器。Tango的核心创新在于其生成式的、过程级别的LLM验证器，该验证器通过RL训练并与生成器共同进化。重要的是，验证器仅基于结果级别的验证正确性奖励进行训练，无需显式的过程级别标注。与确定性或SFT训练的验证器相比，这种通过RL训练的生成式验证器展现出更高的鲁棒性和更优的泛化能力，促进了与生成器之间的有效相互强化。大量实验表明，Tango的两个组件在7B/8B规模模型中均取得了最先进的结果：生成器在五个竞赛级数学基准和四个具有挑战性的跨领域推理任务中均达到顶尖水平，而验证器则在ProcessBench数据集上领先。值得注意的是，两个组件在最具难度的数学推理问题上均表现出显著的提升。代码位于：https://github.com/kaiwenzha/rl-tango。

English

Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.

RL Tango：协同强化生成器与验证器以提升语言推理能力

RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

摘要

Support