基于失败前缀条件化的饱和问题推理模型训练

摘要

尽管可验证奖励的强化学习（RLVR）显著提升了大型语言模型的推理能力，但随着问题趋于饱和，训练进程常陷入停滞。我们发现核心挑战在于信息性失败样本的可及性不足：学习信号虽然存在，但在标准推演过程中鲜少出现。为此，我们提出失败前缀条件化这一简单有效的方法，从饱和问题中持续学习。该方法不再从原始问题出发，而是通过将训练重新分配至基于罕见错误推理轨迹生成的前缀条件，使模型暴露于易失败状态。实验表明，失败前缀条件化带来的性能提升相当于中等难度问题的训练效果，同时保持了标记效率。进一步分析模型鲁棒性发现，该方法能降低模型在误导性失败前缀下的性能衰减，尽管对早期正确推理的遵循程度存在轻微权衡。最后我们证明，在训练过程中动态更新失败前缀的迭代策略，能在性能平台期后实现额外增益。总体而言，失败前缀条件化为RLVR在饱和问题上的持续训练提供了有效路径。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning abilities of large language models (LLMs), yet training often stalls as problems become saturated. We identify the core challenge as the poor accessibility of informative failures: learning signals exist but are rarely encountered during standard rollouts. To address this, we propose failure-prefix conditioning, a simple and effective method for learning from saturated problems. Rather than starting from the original question, our approach reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. We observe that failure-prefix conditioning yields performance gains matching those of training on medium-difficulty problems, while preserving token efficiency. Furthermore, we analyze the model's robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results suggest that failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems.

基于失败前缀条件化的饱和问题推理模型训练

Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

摘要

Support