自适应层间扰动：统一大语言模型强化学习的离策略修正方法

摘要

离策略学习中的策略陈旧与训练-推理失配问题，已成为制约大语言模型强化学习训练稳定性与探索能力的关键瓶颈。为提升推理效率，更新策略与推理策略间的分布差异会不断扩大，导致重要性权重出现重尾现象。当策略在局部呈现尖锐化时，重尾权重会进一步引发梯度激增，使更新突破置信区域。针对此问题，我们提出自适应分层扰动法（ALP），通过在更新时向每层输入隐状态注入可学习的微小扰动，将其作为目标函数中重要性权重的分子项与未变化的推理策略形成对比。该方法通过向中间表示施加受控噪声，既防止更新策略相对推理策略发生剧烈偏离，又通过引入失配噪声扩展策略族覆盖范围。平坦化的分布能自然缩小更新策略与推理策略的差距，降低重要性权重的尾部分布，从而维持训练稳定性。实验数据进一步验证了该机制：在单轮数学推理和多轮工具集成推理任务中，ALP不仅提升了最终性能，还避免了迭代训练中重要性权重尾部和KL散度的爆发性增长，同时增强了探索能力。消融实验表明，全分层表示级扰动效果最优，显著优于部分分层扰动及仅对输出逻辑值扰动的变体方法。

English

Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation(ALP) by injecting small learnable perturbations into input hidden states of each layer during updates, which is used as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy, and enlarges the policy family to cover the inference policy family with mismatch noises. Hence, the flattened distribution can naturally tighten the updated and inference policy gap and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoid blow up of importance ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.