검증기 없이 일반 추론 능력 강화하기

초록

검증 가능한 보상에 대한 DeepSeek-R1-Zero 스타일 강화 학습(RL)을 사용하여 대규모 언어 모델(LLM)을 훈련하는 최근의 패러다임 전환은 코드 및 수학적 추론 분야에서 인상적인 발전을 이끌어냈습니다. 그러나 이 방법론은 규칙 기반 답변 검증이 가능한 작업에만 국한되며, 화학, 의료, 공학, 법률, 생물학, 비즈니스, 경제학과 같은 실생활 도메인으로 자연스럽게 확장되지 않습니다. 현재의 실용적인 해결책은 추가적인 LLM을 모델 기반 검증기로 사용하지만, 이는 강력한 검증기 LLM에 대한 의존성, 보상 해킹에 대한 취약성, 그리고 훈련 중 검증기 모델을 메모리에 유지해야 하는 실질적인 부담과 같은 문제를 야기합니다. 이를 해결하고 DeepSeek-R1-Zero 스타일 훈련을 일반 추론 도메인으로 확장하기 위해, 우리는 답변 검증을 우회하고 대신 RL을 사용하여 참조 답변을 생성할 확률을 직접 최대화하는 검증기 없는 방법(VeriFree)을 제안합니다. 우리는 VeriFree를 검증기 기반 방법과 비교하고, MMLU-Pro, GPQA, SuperGPQA 및 수학 관련 벤치마크에 걸친 광범위한 평가에서 VeriFree가 실질적인 이점과 감소된 계산 요구 사항 외에도 검증기 기반 방법을 능가하거나 동등한 성능을 보임을 입증합니다. 더 나아가, 우리는 이 방법을 정책과 암묵적 검증기를 통합 모델에서 훈련하는 우아한 통합으로서, 그리고 변분 최적화 접근법으로서의 다중 관점에서 통찰을 제공합니다. 코드는 https://github.com/sail-sg/VeriFree에서 확인할 수 있습니다.

English

The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.

검증기 없이 일반 추론 능력 강화하기

Reinforcing General Reasoning without Verifiers

초록

Support