ChatPaper.aiChatPaper

强化学习与视觉推理的泛化局限:数学推理中的两项案例研究

Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning

October 30, 2025
作者: Md Tanvirul Alam, Nidhi Rastogi
cs.AI

摘要

数学推理是大型语言模型面临的核心挑战,不仅要求答案正确,更需要可信的推理过程。可验证奖励的强化学习(RLVR)已成为提升此类能力的有效途径,但其能否真正培养推理能力尚不明确。我们在两个具有完全可验证解的组合问题上展开研究:活动调度问题与最长递增子序列问题,采用包含唯一最优解的精细数据集。通过多种奖励设计发现,RLVR虽能提升评估指标,但往往通过强化表面启发式策略而非习得新推理方法实现。这些发现揭示了RLVR泛化能力的局限性,强调需要能区分真正数学推理与捷径利用的基准测试,以提供对进展的可信衡量。代码详见https://github.com/xashru/rlvr-seq-generalization。
English
Mathematical reasoning is a central challenge for large language models (LLMs), requiring not only correct answers but also faithful reasoning processes. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities; however, its ability to foster genuine reasoning remains unclear. We investigate RLVR on two combinatorial problems with fully verifiable solutions: Activity Scheduling and the Longest Increasing Subsequence, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. These findings highlight the limits of RLVR generalization, emphasizing the importance of benchmarks that disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress. Code available at https://github.com/xashru/rlvr-seq-generalization.
PDF51December 2, 2025