RL Tango: 언어 추론을 위한 생성기와 검증기의 공동 강화 학습

초록

강화 학습(Reinforcement Learning, RL)은 최근 대규모 언어 모델(Large Language Models, LLMs)의 추론 능력을 향상시키는 강력한 접근 방식으로 부상하고 있으며, 이때 LLM 생성기는 검증기(보상 모델)에 의해 지도되는 정책으로 작동합니다. 그러나 현재 LLM을 위한 RL 사후 훈련 방법은 일반적으로 고정된(규칙 기반 또는 동결된 사전 훈련) 검증기를 사용하거나 지도 미세 조정(Supervised Fine-Tuning, SFT)을 통해 판별적으로 훈련된 검증기를 사용합니다. 이러한 설계는 보상 해킹에 취약하며 훈련 분포를 넘어서는 일반화 능력이 떨어집니다. 이러한 한계를 극복하기 위해, 우리는 Tango라는 새로운 프레임워크를 제안합니다. Tango는 RL을 사용하여 LLM 생성기와 검증기를 교차 방식으로 동시에 훈련합니다. Tango의 핵심 혁신은 RL을 통해 훈련되고 생성기와 공동 진화하는 생성적, 프로세스 수준의 LLM 검증기입니다. 특히, 이 검증기는 명시적인 프로세스 수준 주석 없이 결과 수준의 검증 정확도 보상만을 기반으로 훈련됩니다. 이 생성적 RL 훈련 검증기는 결정론적 또는 SFT 훈련 검증기보다 향상된 견고성과 우수한 일반화 능력을 보이며, 생성기와의 효과적인 상호 강화를 촉진합니다. 광범위한 실험을 통해 Tango의 두 구성 요소가 7B/8B 규모 모델 중에서 최첨단 결과를 달성함을 입증했습니다: 생성기는 다섯 가지 경쟁 수준의 수학 벤치마크와 네 가지 도전적인 도메인 외 추론 작업에서 최고 수준의 성능을 보였으며, 검증기는 ProcessBench 데이터셋에서 선두를 차지했습니다. 특히, 두 구성 요소 모두 가장 어려운 수학적 추론 문제에서 특히 큰 개선을 보였습니다. 코드는 https://github.com/kaiwenzha/rl-tango에서 확인할 수 있습니다.

English

Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.

RL Tango: 언어 추론을 위한 생성기와 검증기의 공동 강화 학습

RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

초록

Support