규칙 및 모델 기반 검증기의 함정 — 수학적 추론 사례 연구

초록

검증 가능한 보상과 함께 강화 학습(RLVR)의 성공을 위해서는 신뢰할 수 있는 검증자가 필수적이며, 이는 DeepSeek-R1과 같은 다양한 대규모 추론 모델의 핵심 방법론이다. 수학적 추론과 같은 복잡한 영역에서는 이전 연구들에서 강력한 추론 모델을 훈련하기 위해 규칙 기반 검증자가 널리 채택되어 왔다. 그러나 이러한 검증자의 신뢰성과 RL 훈련 과정에 미치는 영향은 여전히 잘 이해되지 않고 있다. 본 연구에서는 수학적 추론을 사례 연구로 삼아 정적 평가와 RL 훈련 시나리오에서 다양한 검증자에 대한 포괄적인 분석을 수행한다. 먼저, 현재의 오픈소스 규칙 기반 검증자들은 여러 일반적으로 사용되는 수학 데이터셋에서 서로 다른 형식으로 제시된 동등한 답변을 인식하지 못해 상당한 오류 음성률을 보이는 경우가 많다는 것을 발견했다. 이러한 한계는 RL 훈련 성능에 부정적인 영향을 미치며, 정책 모델이 강해질수록 더 두드러진다. 이후, 이러한 한계를 해결하기 위한 잠재적 해결책으로 모델 기반 검증자를 조사한다. 정적 평가에서 모델 기반 검증자가 상당히 높은 검증 정확도를 달성하는 것으로 나타났지만, 추가 분석과 RL 훈련 결과는 이들이 특정 패턴의 응답을 잘못 분류하여 오류 양성을 발생시키는 해킹에 매우 취약하다는 것을 시사한다. 이러한 취약점은 정책 모델 최적화 과정에서 악용되어 인위적으로 부풀려진 보상을 초래한다. 본 연구의 결과는 규칙 기반 및 모델 기반 검증자에 내재된 독특한 위험을 강조하며, 강화 학습에서 더 견고한 보상 시스템을 개발하기 위한 유용한 통찰을 제공하고자 한다.

English

Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical reasoning, rule-based verifiers have been widely adopted in previous works to train strong reasoning models. However, the reliability of these verifiers and their impact on the RL training process remain poorly understood. In this work, we take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios. First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across multiple commonly used mathematical datasets, resulting in non-negligible false negative rates. This limitation adversely affects RL training performance and becomes more pronounced as the policy model gets stronger. Subsequently, we investigate model-based verifiers as a potential solution to address these limitations. While the static evaluation shows that model-based verifiers achieve significantly higher verification accuracy, further analysis and RL training results imply that they are highly susceptible to hacking, where they misclassify certain patterns in responses as correct (i.e., false positives). This vulnerability is exploited during policy model optimization, leading to artificially inflated rewards. Our findings underscore the unique risks inherent to both rule-based and model-based verifiers, aiming to offer valuable insights to develop more robust reward systems in reinforcement learning.

규칙 및 모델 기반 검증기의 함정 — 수학적 추론 사례 연구

Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning

초록

Support