TinyV: 検証における偽陰性の低減がLLMの推論における強化学習を改善する

要旨

強化学習（Reinforcement Learning, RL）は、報酬信号を用いてポリシーを最適化することで、大規模言語モデル（Large Language Models, LLMs）の推論能力を向上させる強力なツールとなっています。しかし、RLの成功は検証者が提供する報酬の信頼性に依存しています。本論文では、検証者が正しいモデル出力を誤って拒否する「偽陰性（false negatives）」という広範な問題を明らかにし、分析します。Big-Math-RL-Verifiedデータセットの詳細な調査により、モデルが生成した回答の38%以上が偽陰性に陥り、検証者が正解を認識できないことが判明しました。私たちは、経験的および理論的に、これらの偽陰性が情報量のある勾配信号を奪い、収束を遅らせることでRLトレーニングに深刻な悪影響を及ぼすことを示します。これを緩和するため、既存のルールベースの手法を補完する軽量なLLMベースの検証器「tinyV」を提案します。tinyVは、動的に潜在的な偽陰性を特定し、有効な回答を回復することで、より正確な報酬推定を実現します。複数の数学推論ベンチマークにおいて、tinyVを統合することで、ベースラインと比較して合格率が最大10%向上し、収束が加速することが確認されました。本研究は、検証者の偽陰性に対処することの重要性を強調し、LLMのRLベースのファインチューニングを改善する実用的なアプローチを提供します。コードはhttps://github.com/uw-nsl/TinyVで公開されています。

English

Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs. Our code is available at https://github.com/uw-nsl/TinyV.

TinyV: 検証における偽陰性の低減がLLMの推論における強化学習を改善する

TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning

要旨

Support