越难越强:基于难度感知GRPO与多维度问题重构的数学推理能力提升
Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
January 28, 2026
作者: Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu
cs.AI
摘要
可验证奖励强化学习(RLVR)为增强大模型的数学推理能力提供了稳健机制。然而我们发现,现有方法在算法和数据层面均存在系统性不足——尽管挑战性难题对完善未充分发展的能力至关重要,但现有研究对其重视程度明显不够。算法层面,广泛使用的组相对策略优化(GRPO)存在隐性失衡问题,即对难度较高题目的策略更新幅度反而更小。数据层面,增强方法主要通过改写问题来提升多样性,却未能系统性地增加问题内在难度。针对这些问题,我们提出双轮驱动的MathForge框架,从算法和数据双维度攻坚难题,该框架包含难度感知组策略优化(DGPO)算法和多维度问题重构(MQR)策略。具体而言,DGPO首先通过难度均衡的组优势估计修正GRPO的隐性失衡,进而采用难度感知的题目级加权机制优先处理难题;与此同时,MQR从多维度重构问题以提升难度,同时保持原标准答案不变。MathForge形成协同闭环:MQR拓展数据边界,DGPO则有效学习增强数据。大量实验表明,MathForge在多项数学推理任务上显著超越现有方法。代码与增强数据均已开源:https://github.com/AMAP-ML/MathForge。
English
Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities. Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. To address these issues, we propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy. Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation, and further prioritizes harder questions by difficulty-aware question-level weighting. Meanwhile, MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data. Extensive experiments show that MathForge significantly outperforms existing methods on various mathematical reasoning tasks. The code and augmented data are all available at https://github.com/AMAP-ML/MathForge.