다양한 도메인에서 검증 가능한 보상을 통한 강화 학습 확장

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 잘 구조화된 참조 답변이 존재하는 수학적 추론 및 코딩 작업에서 유망한 결과를 보여왔습니다. 그러나 보다 광범위한 도메인에 대한 적용 가능성은 아직 충분히 탐구되지 않았습니다. 본 연구에서는 RLVR을 의학, 화학, 심리학, 경제학 등 다양한 도메인으로 확장하는 방안을 탐구합니다. 객관적인 참조 답변이 존재할 때, 서로 다른 대규모 언어 모델(LLM) 간의 이진 판단에서 높은 일치도를 관찰함으로써, 도메인 특화 보상 모델 학습을 위한 대규모 주석의 필요성에 의문을 제기합니다. 비구조화된 참조 답변을 다룰 때 이진 보상의 한계를 해결하기 위해, 우리는 RLVR에 모델 기반의 소프트 스코어링을 추가하여 유연성을 개선합니다. 실험 결과, 증류된 생성형 보상 모델이 도메인 특화 주석 없이도 RL을 위한 신뢰할 수 있는 보상 신호를 제공하는 효과적인 크로스 도메인 검증자 역할을 할 수 있음을 확인했습니다. 7B 베이스 모델을 다양한 RL 알고리즘을 사용해 우리의 보상 모델에 맞춰 미세 조정함으로써, Qwen2.5-72B-Instruct 및 DeepSeek-R1-Distill-Qwen-32B와 같은 최첨단 오픈소스 정렬 LLM을 자유 형식 답변 설정에서 크게 능가하는 정책을 얻었습니다. 이는 또한 RLVR의 견고성과 확장성을 강화하며, 노이즈가 있거나 약한 레이블이 있는 실제 세계 응용 프로그램에 대한 잠재력을 강조합니다.

English

Reinforcement learning (RL) with verifiable rewards (RLVR) has shown promising results in mathematical reasoning and coding tasks where well-structured reference answers are available. However, its applicability to broader domains remains underexplored. In this work, we study the extension of RLVR to more diverse domains such as medicine, chemistry, psychology, and economics. We observe high agreement in binary judgments across different large language models (LLMs) when objective reference answers exist, which challenges the necessity of large-scale annotation for training domain-specific reward models. To address the limitations of binary rewards when handling unstructured reference answers, we further incorporate model-based soft scoring into RLVR to improve its flexibility. Our experiments show that a distilled generative reward model can serve as an effective cross-domain verifier, providing reliable reward signals for RL without requiring domain-specific annotations. By fine-tuning a base 7B model using various RL algorithms against our reward model, we obtain policies that outperform state-of-the-art open-source aligned LLMs such as Qwen2.5-72B-Instruct and DeepSeek-R1-Distill-Qwen-32B by a large margin, across domains in free-form answer settings. This also strengthens RLVR's robustness and scalability, highlighting its potential for real-world applications with noisy or weak labels.

다양한 도메인에서 검증 가능한 보상을 통한 강화 학습 확장

Expanding RL with Verifiable Rewards Across Diverse Domains

초록

Support