RL Tango: 言語推論のための生成器と検証器の協調的強化

要旨

強化学習（RL）は最近、大規模言語モデル（LLM）の推論能力を向上させるための有力なアプローチとして注目を集めています。ここでは、LLM生成器が検証器（報酬モデル）によって導かれるポリシーとして機能します。しかし、現在のLLMに対するRL事後学習手法では、通常、固定された（ルールベースまたは凍結された事前学習済み）検証器、あるいは教師ありファインチューニング（SFT）を通じて識別的に訓練された検証器が使用されます。このような設計は報酬ハッキングの影響を受けやすく、訓練分布を超えた汎化性能が低いという問題があります。これらの制限を克服するため、我々はTangoという新しいフレームワークを提案します。Tangoは、LLM生成器と検証器を交互に訓練するためにRLを利用します。Tangoの中核的な革新点は、プロセスレベルのLLM検証器を生成的に訓練し、生成器と共進化させることです。重要なのは、検証器が明示的なプロセスレベルのアノテーションを必要とせず、結果レベルの検証正解報酬のみに基づいて訓練される点です。この生成的なRL訓練済み検証器は、決定論的またはSFT訓練済み検証器と比較して、堅牢性と優れた汎化性能を示し、生成器との効果的な相互強化を促進します。大規模な実験により、Tangoの両コンポーネントが7B/8Bスケールのモデルの中で最先端の結果を達成することが示されました。生成器は、5つの競争レベルの数学ベンチマークと4つの挑戦的なドメイン外推論タスクで最高の性能を発揮し、検証器はProcessBenchデータセットでリードしています。特に、両コンポーネントは最も難しい数学的推論問題において顕著な改善を示しました。コードは以下にあります: https://github.com/kaiwenzha/rl-tango。

English

Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.

RL Tango: 言語推論のための生成器と検証器の協調的強化

RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

要旨

Support