自提干预策略：大型语言模型推理中的信用分配机制实现

摘要

结果奖励型强化学习（RL）已被证明能有效提升大语言模型（LLM）的推理能力。然而，传统RL仅对最终答案进行信用分配：若结果错误，整个推理链都会受到惩罚；若结果正确，所有步骤则被统一强化。这导致错误轨迹中的正确中间步骤可能被抑制，而成功轨迹中的无效步骤反而被强化。我们将这种失效模式称为信用分配问题。虽然训练过程奖励模型是自然解决方案，但精准优化此类模型以识别纠错性推理步骤仍具挑战性。本文提出干预训练（InT），该训练范式使模型通过提出简短、定向的修正方案来自主完成推理轨迹的细粒度信用分配，从而将轨迹导向更高奖励。利用数学推理数据集中普遍存在的参考答案，并基于“验证模型生成解比从头生成正确解更易实现”这一事实，模型可识别自身推理中的首个错误，并提出单步干预以将轨迹导向正确解。随后，我们通过监督微调（SFT）将策略执行轨迹（截至错误点）与干预措施拼接，从而将错误定位至导致失败的具体步骤。实验表明，由此得到的模型可作为更优质的RL训练初始化参数。经过InT及后续RL微调，我们在IMO-AnswerBench上将4B参数基模型的准确率提升近14%，性能超越gpt-oss-20b等更大规模的开源模型。

English

Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs). However, standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect and uniformly reinforcing all steps when it is correct. As a result, correct intermediate steps may be discouraged in failed traces, while spurious steps may be reinforced in successful ones. We refer to this failure mode as the problem of credit assignment. While a natural remedy is to train a process reward model, accurately optimizing such models to identify corrective reasoning steps remains challenging. We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces by proposing short, targeted corrections that steer trajectories toward higher reward. Using reference solutions commonly available in mathematical reasoning datasets and exploiting the fact that verifying a model-generated solution is easier than generating a correct one from scratch, the model identifies the first error in its reasoning and proposes a single-step intervention to redirect the trajectory toward the correct solution. We then apply supervised fine-tuning (SFT) to the on-policy rollout up to the point of error concatenated with the intervention, localizing error to the specific step that caused failure. We show that the resulting model serves as a far better initialization for RL training. After running InT and subsequent fine-tuning with RL, we improve accuracy by nearly 14% over a 4B-parameter base model on IMO-AnswerBench, outperforming larger open-source models such as gpt-oss-20b.

自提干预策略：大型语言模型推理中的信用分配机制实现

InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning

摘要

Support