RL探戈：协同强化生成器与验证器以推进语言推理

摘要

强化学习（RL）近期作为一种提升大型语言模型（LLMs）推理能力的有效方法崭露头角，其中LLM生成器作为由验证器（奖励模型）引导的策略。然而，当前针对LLMs的RL后训练方法通常采用固定（基于规则或预训练冻结）或通过监督微调（SFT）判别式训练的验证器。此类设计易受奖励欺骗影响，且在训练分布之外泛化能力较差。为克服这些局限，我们提出了Tango，一个新颖的框架，利用RL以交替方式同时训练LLM生成器与验证器。Tango的核心创新在于其生成式、过程级别的LLM验证器，该验证器通过RL训练并与生成器共同进化。重要的是，验证器仅基于结果级别的验证正确性奖励进行训练，无需显式过程级别标注。与确定性或SFT训练的验证器相比，这种通过RL训练的生成式验证器展现出更高的鲁棒性和优越的泛化能力，促进了与生成器之间的有效相互强化。大量实验证明，Tango的两个组成部分在7B/8B规模模型中均取得了顶尖成果：生成器在五个竞赛级数学基准测试和四个极具挑战性的跨领域推理任务中表现最佳，而验证器则在ProcessBench数据集上领先。值得注意的是，两者在最难的数学推理问题上均展现出尤为显著的进步。代码位于：https://github.com/kaiwenzha/rl-tango。

English

Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.

RL探戈：协同强化生成器与验证器以推进语言推理

RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

摘要

Support