保持在最佳状态：通过能力适应性提示支架实现响应式推理演进

摘要

基于可验证奖励的强化学习（RLVR）在提升大语言模型（LLMs）的推理能力方面取得了显著成功。然而，现有的RLVR方法常因训练数据难度与模型能力不匹配而面临探索效率低下的问题。当问题过于复杂时，LLMs难以发现可行的推理路径；而当问题过于简单时，模型又难以学到新的能力。在本研究中，我们通过量化损失下降速度与推理准确率之间的关系，形式化地分析了问题难度的影响。基于这一分析，我们提出了SEELE，一种新颖的监督辅助RLVR框架，它能动态调整问题难度，使其始终保持在高效区域。SEELE通过在原始问题后附加提示（完整解答的一部分）来增强每个训练样本。与以往的提示方法不同，SEELE有意识地、自适应地调整每个问题的提示长度，以达到最佳难度。为了确定最优提示长度，SEELE采用了一种多轮推理采样策略。在每一轮中，它根据前几轮收集的准确率-提示对拟合一个项目反应理论模型，以预测下一轮所需的提示长度。这种实例级、实时的难度调整使问题难度与模型能力的演进保持一致，从而提高了探索效率。实验结果表明，SEELE在六个数学推理基准测试中，分别比组相对策略优化（GRPO）和监督微调（SFT）高出+11.8和+10.5分，平均比之前最佳的监督辅助方法高出+3.6分。

English

Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs). However, existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability. LLMs fail to discover viable reasoning paths when problems are overly difficult, while learning little new capability when problems are too simple. In this work, we formalize the impact of problem difficulty by quantifying the relationship between loss descent speed and rollout accuracy. Building on this analysis, we propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region. SEELE augments each training sample by appending a hint (part of a full solution) after the original problem. Unlike previous hint-based approaches, SEELE deliberately and adaptively adjusts the hint length for each problem to achieve an optimal difficulty. To determine the optimal hint length, SEELE employs a multi-round rollout sampling strategy. In each round, it fits an item response theory model to the accuracy-hint pairs collected in preceding rounds to predict the required hint length for the next round. This instance-level, real-time difficulty adjustment aligns problem difficulty with the evolving model capability, thereby improving exploration efficiency. Experimental results show that SEELE outperforms Group Relative Policy Optimization (GRPO) and Supervised Fine-tuning (SFT) by +11.8 and +10.5 points, respectively, and surpasses the best previous supervision-aided approach by +3.6 points on average across six math reasoning benchmarks.

保持在最佳状态：通过能力适应性提示支架实现响应式推理演进

Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

摘要

Support