保持在最佳狀態:通過能力適應性提示支架實現響應式推理演進
Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding
September 8, 2025
作者: Ziheng Li, Zexu Sun, Jinman Zhao, Erxue Min, Yongcheng Zeng, Hui Wu, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen, Zhi-Hong Deng
cs.AI
摘要
可驗證獎勵的強化學習(RLVR)在提升大型語言模型(LLMs)的推理能力方面取得了顯著成功。然而,現有的RLVR方法常因訓練數據難度與模型能力不匹配而導致探索效率低下。當問題過於困難時,LLMs無法發現可行的推理路徑;而當問題過於簡單時,LLMs則學不到新的能力。在本研究中,我們通過量化損失下降速度與推理準確性之間的關係,正式化問題難度的影響。基於此分析,我們提出了SEELE,一種新穎的監督輔助RLVR框架,該框架動態調整問題難度以保持在高效區域內。SEELE通過在原始問題後附加提示(完整解答的一部分)來增強每個訓練樣本。與以往的提示方法不同,SEELE有意且自適應地調整每個問題的提示長度以達到最佳難度。為了確定最佳提示長度,SEELE採用了一種多輪推理採樣策略。在每一輪中,它根據前幾輪收集的準確性-提示對擬合一個項目反應理論模型,以預測下一輪所需的提示長度。這種實例級別的實時難度調整使問題難度與模型能力的演變保持一致,從而提高了探索效率。實驗結果顯示,SEELE在六個數學推理基準測試中,分別比群組相對策略優化(GRPO)和監督微調(SFT)高出+11.8和+10.5分,並且平均比之前最佳的監督輔助方法高出+3.6分。
English
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable
success in enhancing the reasoning capabilities of large language models
(LLMs). However, existing RLVR methods often suffer from exploration
inefficiency due to mismatches between the training data's difficulty and the
model's capability. LLMs fail to discover viable reasoning paths when problems
are overly difficult, while learning little new capability when problems are
too simple. In this work, we formalize the impact of problem difficulty by
quantifying the relationship between loss descent speed and rollout accuracy.
Building on this analysis, we propose SEELE, a novel supervision-aided RLVR
framework that dynamically adjusts problem difficulty to stay within the
high-efficiency region. SEELE augments each training sample by appending a hint
(part of a full solution) after the original problem. Unlike previous
hint-based approaches, SEELE deliberately and adaptively adjusts the hint
length for each problem to achieve an optimal difficulty. To determine the
optimal hint length, SEELE employs a multi-round rollout sampling strategy. In
each round, it fits an item response theory model to the accuracy-hint pairs
collected in preceding rounds to predict the required hint length for the next
round. This instance-level, real-time difficulty adjustment aligns problem
difficulty with the evolving model capability, thereby improving exploration
efficiency. Experimental results show that SEELE outperforms Group Relative
Policy Optimization (GRPO) and Supervised Fine-tuning (SFT) by +11.8 and +10.5
points, respectively, and surpasses the best previous supervision-aided
approach by +3.6 points on average across six math reasoning benchmarks.