適応的層間摂動: LLM強化学習におけるオフポリシー補正の統合

要旨

オフポリシー問題、すなわちポリシーの陳腐化や学習・推論間のミスマッチは、LLM強化学習における学習の安定性とさらなる探索の主要なボトルネックとなっている。推論効率を高めようとすると、更新されたポリシーと推論時のポリシーとの分布ギャップが拡大し、重要度比が裾の重い分布を示すようになる。裾の重い比は、ポリシーが局所的に急峻（シャープ）になった際に生じ、これがさらに急峻な勾配を膨張させ、更新を信頼領域外に押し出す可能性がある。この問題に対処するため、我々はAdaptive Layerwise Perturbation (ALP) を提案する。これは、更新時に各層の入力隠れ状態に小さな学習可能な摂動を注入し、これを目的関数内で変更されていない推論ポリシーに対する重要度比の分子として用いるものである。直感的には、中間表現に制御されたノイズを加えることで、ALPは更新されたポリシーが推論ポリシーから急激に乖離することを防ぎ、ミスマッチノイズを含む推論ポリシーファミリーをカバーするようにポリシーファミリーを拡大する。その結果、平坦化された分布は、更新ポリシーと推論ポリシーのギャップを自然に狭め、重要度比の裾を減少させることで、学習の安定性を維持する。これは実験的にもさらに検証されている。単一ターンの数学タスクと複数ターンのツール統合推論タスクにおける実験では、ALPが最終性能を向上させるだけでなく、反復学習中の重要度比の裾の爆発的増大やKLダイバージェンスの急上昇を回避し、探索を促進することが示されている。アブレーション研究により、全ての層にわたる表現レベルでの摂動が最も効果的であり、一部の層のみやロジットのみへの摂動を大幅に上回ることを確認した。

English

Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation(ALP) by injecting small learnable perturbations into input hidden states of each layer during updates, which is used as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy, and enlarges the policy family to cover the inference policy family with mismatch noises. Hence, the flattened distribution can naturally tighten the updated and inference policy gap and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoid blow up of importance ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.

適応的層間摂動: LLM強化学習におけるオフポリシー補正の統合

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

要旨

Support