RL에 가치를 다시 부여하기: 검증기와 LLM 추론자를 통합하여 테스트 시 스케일링 개선

초록

LLM 추론기 미세 조정을 위해 널리 사용되는 강화 학습(RL) 방법들, 예를 들어 GRPO나 Leave-one-out PPO 등은 학습된 가치 함수를 버리고 경험적으로 추정된 반환값을 선호합니다. 이는 검증을 위해 가치 함수를 사용하는 테스트 시간 계산 확장을 방해합니다. 본 연구에서는 RL^V를 제안합니다. RL^V는 "가치 없는" RL 방법을 보강하여, RL 생성 데이터를 사용해 LLM을 추론기와 생성 검증기로 공동 학습시킴으로써, 상당한 오버헤드 없이 검증 기능을 추가합니다. 실험적으로, RL^V는 병렬 샘플링을 통해 MATH 정확도를 20% 이상 향상시키고, 기본 RL 방법에 비해 8-32배 효율적인 테스트 시간 계산 확장을 가능하게 합니다. 또한 RL^V는 쉬운 작업에서 어려운 작업으로의 전이 및 도메인 외 작업에 대한 강력한 일반화 능력을 보여줍니다. 더 나아가, RL^V는 긴 추론 R1 모델과 함께 병렬 및 순차적 테스트 시간 계산을 공동으로 확장할 때 1.2-1.6배 더 높은 성능을 달성합니다.

English

Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL^V that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL^V boosts MATH accuracy by over 20\% with parallel sampling and enables 8-32times efficient test-time compute scaling compared to the base RL method. RL^V also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL^V achieves 1.2-1.6times higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.

RL에 가치를 다시 부여하기: 검증기와 LLM 추론자를 통합하여 테스트 시 스케일링 개선

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

초록

Support