保持在最佳狀態：通過能力適應性提示支架實現響應式推理演進

摘要

可驗證獎勵的強化學習（RLVR）在提升大型語言模型（LLMs）的推理能力方面取得了顯著成功。然而，現有的RLVR方法常因訓練數據難度與模型能力不匹配而導致探索效率低下。當問題過於困難時，LLMs無法發現可行的推理路徑；而當問題過於簡單時，LLMs則學不到新的能力。在本研究中，我們通過量化損失下降速度與推理準確性之間的關係，正式化問題難度的影響。基於此分析，我們提出了SEELE，一種新穎的監督輔助RLVR框架，該框架動態調整問題難度以保持在高效區域內。SEELE通過在原始問題後附加提示（完整解答的一部分）來增強每個訓練樣本。與以往的提示方法不同，SEELE有意且自適應地調整每個問題的提示長度以達到最佳難度。為了確定最佳提示長度，SEELE採用了一種多輪推理採樣策略。在每一輪中，它根據前幾輪收集的準確性-提示對擬合一個項目反應理論模型，以預測下一輪所需的提示長度。這種實例級別的實時難度調整使問題難度與模型能力的演變保持一致，從而提高了探索效率。實驗結果顯示，SEELE在六個數學推理基準測試中，分別比群組相對策略優化（GRPO）和監督微調（SFT）高出+11.8和+10.5分，並且平均比之前最佳的監督輔助方法高出+3.6分。

English

Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs). However, existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability. LLMs fail to discover viable reasoning paths when problems are overly difficult, while learning little new capability when problems are too simple. In this work, we formalize the impact of problem difficulty by quantifying the relationship between loss descent speed and rollout accuracy. Building on this analysis, we propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region. SEELE augments each training sample by appending a hint (part of a full solution) after the original problem. Unlike previous hint-based approaches, SEELE deliberately and adaptively adjusts the hint length for each problem to achieve an optimal difficulty. To determine the optimal hint length, SEELE employs a multi-round rollout sampling strategy. In each round, it fits an item response theory model to the accuracy-hint pairs collected in preceding rounds to predict the required hint length for the next round. This instance-level, real-time difficulty adjustment aligns problem difficulty with the evolving model capability, thereby improving exploration efficiency. Experimental results show that SEELE outperforms Group Relative Policy Optimization (GRPO) and Supervised Fine-tuning (SFT) by +11.8 and +10.5 points, respectively, and surpasses the best previous supervision-aided approach by +3.6 points on average across six math reasoning benchmarks.

保持在最佳狀態：通過能力適應性提示支架實現響應式推理演進

Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

摘要

Support