TinyV:降低驗證中的假陰性提升大型語言模型的強化學習推理能力
TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning
May 20, 2025
作者: Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
cs.AI
摘要
強化學習(Reinforcement Learning, RL)已成為提升大型語言模型(Large Language Models, LLMs)推理能力的強大工具,通過獎勵信號來優化其策略。然而,RL的成功依賴於驗證器提供的獎勵的可靠性。在本文中,我們揭露並分析了一個普遍存在的問題——假陰性(false negatives),即驗證器錯誤地拒絕了模型的正確輸出。我們對Big-Math-RL-Verified數據集的深入研究顯示,超過38%的模型生成回應遭受假陰性問題,驗證器未能識別出正確答案。我們從實證和理論兩方面證明,這些假陰性嚴重損害了RL訓練,剝奪了模型獲取信息梯度信號的機會,並減緩了收斂速度。為緩解這一問題,我們提出了tinyV,這是一個基於輕量級LLM的驗證器,它增強了現有的基於規則的方法,動態識別潛在的假陰性並恢復有效回應,以產生更準確的獎勵估計。在多個數學推理基準測試中,整合TinyV使通過率提升了高達10%,並相較於基準線加速了收斂。我們的研究結果強調了解決驗證器假陰性問題的關鍵重要性,並提供了一種實用的方法來改進基於RL的LLM微調。我們的代碼可在https://github.com/uw-nsl/TinyV獲取。
English
Reinforcement Learning (RL) has become a powerful tool for enhancing the
reasoning abilities of large language models (LLMs) by optimizing their
policies with reward signals. Yet, RL's success relies on the reliability of
rewards, which are provided by verifiers. In this paper, we expose and analyze
a widespread problem--false negatives--where verifiers wrongly reject correct
model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals
that over 38% of model-generated responses suffer from false negatives, where
the verifier fails to recognize correct answers. We show, both empirically and
theoretically, that these false negatives severely impair RL training by
depriving the model of informative gradient signals and slowing convergence. To
mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments
existing rule-based methods, which dynamically identifies potential false
negatives and recovers valid responses to produce more accurate reward
estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts
pass rates by up to 10% and accelerates convergence relative to the baseline.
Our findings highlight the critical importance of addressing verifier false
negatives and offer a practical approach to improve RL-based fine-tuning of
LLMs. Our code is available at https://github.com/uw-nsl/TinyV.Summary
AI-Generated Summary