ChatPaper.aiChatPaper

強化學習驗證器泛化能力的局限:數學推理中的兩個案例研究

Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning

October 30, 2025
作者: Md Tanvirul Alam, Nidhi Rastogi
cs.AI

摘要

數學推理是大型語言模型面臨的核心挑戰,不僅要求答案正確,更需要可靠的推理過程。可驗證獎勵的強化學習(RLVR)已成為提升此類能力的潛在途徑,但其能否真正培養深度推理能力尚不明確。我們在兩個具有完全可驗證解的組合問題(活動排程與最長遞增子序列)上研究RLVR,採用精心設計且具唯一最佳解的數據集。通過多種獎勵機制測試發現,RLVR雖能提升評估指標,但往往通過強化表面啟發式規則而非習得新推理策略實現。這些結果揭示了RLVR泛化能力的局限,強調需建立能區分真實數學推理與捷徑利用的基準測試,從而提供對進展的可靠衡量。程式碼詳見https://github.com/xashru/rlvr-seq-generalization。
English
Mathematical reasoning is a central challenge for large language models (LLMs), requiring not only correct answers but also faithful reasoning processes. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities; however, its ability to foster genuine reasoning remains unclear. We investigate RLVR on two combinatorial problems with fully verifiable solutions: Activity Scheduling and the Longest Increasing Subsequence, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. These findings highlight the limits of RLVR generalization, emphasizing the importance of benchmarks that disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress. Code available at https://github.com/xashru/rlvr-seq-generalization.
PDF61February 7, 2026