TinyV：降低验证中的假阴性提升大语言模型的强化学习推理能力

摘要

强化学习（Reinforcement Learning, RL）已成为提升大型语言模型（Large Language Models, LLMs）推理能力的强大工具，通过奖励信号优化其策略。然而，RL的成功依赖于验证者提供的奖励的可靠性。本文揭示并分析了一个普遍存在的问题——假阴性（false negatives），即验证者错误地拒绝了模型生成的正确答案。我们对Big-Math-RL-Verified数据集的深入研究表明，超过38%的模型生成响应遭受假阴性问题，验证者未能识别出正确答案。我们通过实证和理论分析表明，这些假阴性严重损害了RL训练，剥夺了模型获取信息梯度信号的机会，并减缓了收敛速度。为缓解这一问题，我们提出了tinyV，一个基于轻量级LLM的验证器，它增强了现有的基于规则的方法，动态识别潜在的假阴性并恢复有效响应，以生成更准确的奖励估计。在多个数学推理基准测试中，集成TinyV使通过率提升了高达10%，并相对于基线加速了收敛。我们的研究结果强调了解决验证者假阴性问题的至关重要性，并提供了一种实用的方法来改进基于RL的LLM微调。我们的代码可在https://github.com/uw-nsl/TinyV获取。

English

Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs. Our code is available at https://github.com/uw-nsl/TinyV.

TinyV：降低验证中的假阴性提升大语言模型的强化学习推理能力

TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning

摘要

Support