ChatPaper.aiChatPaper

基于规则与模型的验证器之陷阱——以数学推理为例的案例研究

Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning

May 28, 2025
作者: Yuzhen Huang, Weihao Zeng, Xingshan Zeng, Qi Zhu, Junxian He
cs.AI

摘要

可信的驗證器對於強化學習中可驗證獎勵(RLVR)的成功至關重要,這是諸如DeepSeek-R1等多種大型推理模型背後的核心方法論。在數學推理等複雜領域,基於規則的驗證器在先前的研究中已被廣泛採用,用以訓練強大的推理模型。然而,這些驗證器的可靠性及其對RL訓練過程的影響仍鮮為人知。本研究以數學推理為案例,對多種驗證器在靜態評估與RL訓練情境下進行了全面分析。首先,我們發現當前開源的基於規則的驗證器在多個常用數學數據集上,往往無法識別以不同格式呈現的等效答案,導致不可忽視的假陰性率。這一限制對RL訓練性能產生負面影響,並隨著策略模型的增強而愈發顯著。隨後,我們探討了基於模型的驗證器作為解決這些限制的潛在方案。雖然靜態評估顯示基於模型的驗證器實現了顯著更高的驗證準確率,但進一步分析及RL訓練結果表明,它們極易受到攻擊,即錯誤地將回應中的某些模式分類為正確(即假陽性)。這一脆弱性在策略模型優化過程中被利用,導致獎勵被人為誇大。我們的研究結果揭示了基於規則與基於模型的驗證器各自固有的獨特風險,旨在為開發更為穩健的強化學習獎勵系統提供寶貴見解。
English
Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical reasoning, rule-based verifiers have been widely adopted in previous works to train strong reasoning models. However, the reliability of these verifiers and their impact on the RL training process remain poorly understood. In this work, we take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios. First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across multiple commonly used mathematical datasets, resulting in non-negligible false negative rates. This limitation adversely affects RL training performance and becomes more pronounced as the policy model gets stronger. Subsequently, we investigate model-based verifiers as a potential solution to address these limitations. While the static evaluation shows that model-based verifiers achieve significantly higher verification accuracy, further analysis and RL training results imply that they are highly susceptible to hacking, where they misclassify certain patterns in responses as correct (i.e., false positives). This vulnerability is exploited during policy model optimization, leading to artificially inflated rewards. Our findings underscore the unique risks inherent to both rule-based and model-based verifiers, aiming to offer valuable insights to develop more robust reward systems in reinforcement learning.

Summary

AI-Generated Summary

PDF62May 29, 2025