越難越好:透過難度感知GRPO與多面向問題重構提升數學推理能力
Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
January 28, 2026
作者: Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu, Zhiwu Lu
cs.AI
摘要
基於可驗證獎勵的強化學習(RLVR)為增強大型模型的數學推理能力提供了穩健機制。然而我們發現,現有方法從算法和數據兩個維度都存在系統性缺失——儘管挑戰性問題對於完善未充分發展的能力至關重要,但現有方法對其重視不足。算法層面,廣泛使用的群組相對策略優化(GRPO)存在隱性失衡問題:策略更新幅度對難題反而較小。數據層面,增強方法主要通過改述問題來提升多樣性,未能系統性增加內在難度。為解決這些問題,我們提出雙維度MathForge框架,通過從兩個維度針對難題進行優化來提升數學推理能力。該框架包含難度感知群組策略優化(DGPO)算法和多維度問題重構(MQR)策略。具體而言,DGPO首先通過難度平衡的群組優勢估計修正GRPO的隱性失衡,並進一步採用難度感知的問題級加權機制優先處理難題;而MQR則通過多維度重構問題在保持原答案不變的前提下提升難度。MathForge形成協同循環:MQR拓展數據邊界,DGPO則有效學習增強後的數據。大量實驗表明,MathForge在多類數學推理任務上顯著優於現有方法。代碼與增強數據均已開源於:https://github.com/AMAP-ML/MathForge。
English
Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities. Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. To address these issues, we propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy. Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation, and further prioritizes harder questions by difficulty-aware question-level weighting. Meanwhile, MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data. Extensive experiments show that MathForge significantly outperforms existing methods on various mathematical reasoning tasks. The code and augmented data are all available at https://github.com/AMAP-ML/MathForge.