적응적 계층별 섭동: LLM 강화학습을 위한 오프-폴리시 보정 통합

초록

정책 경직성(policy staleness) 및 훈련-추론 불일치(training-inference mismatch)와 같은 오프-폴리시(off-policy) 문제들은 LLM 강화학습의 훈련 안정성과 추가적인 탐색을 위한 주요 병목 현상으로 대두되고 있습니다. 추론 효율성을 높이기 위해 추론 정책과 업데이트된 정책 간의 분포 차이가 커지면, 이는 중요도 비율의 무거운 꼬리(heavy-tailed) 현상을 초래합니다. 무거운 꼬리 비율은 정책이 지역적으로 급격하게 변화할 때 발생하며, 이는 급격한 기울기를 더욱 부풀려 업데이트가 신뢰 영역(trust region)을 벗어나게 할 수 있습니다. 이를 해결하기 위해 본 논문에서는 Adaptive Layerwise Perturbation(ALP)을 제안합니다. ALP은 업데이트 동안 각 계층의 입력 은닉 상태에 작은 학습 가능한 섭동(perturbation)을 주입하며, 이는 목적 함수 내에서 변경되지 않은 추론 정책에 대한 중요도 비율의 분자(numerator)로 사용됩니다. 직관적으로, 중간 표현에 제어된 노이즈를 추가함으로써 ALP는 업데이트된 정책이 추론 정책으로부터 지나치게 급격하게 벗어나는 것을 방지하고, 불일치 노이즈를 포함하는 추론 정책 패밀리를 포괄할 수 있도록 정책 패밀리를 확장합니다. 따라서 평탄화된 분포는 업데이트된 정책과 추론 정책 간의 차이를 자연스럽게 좁히고 중요도 비율의 꼬리를 감소시켜 훈련 안정성을 유지합니다. 이는 실증적으로도 추가로 검증되었습니다. 단일 턴 수학 문제 및 다중 턴 도구 통합 추론 작업에 대한 실험 결과, ALP는 최종 성능을 향상시킬 뿐만 아니라 반복적 훈련 과정에서 중요도 비율 꼬리와 KL 발산(KL spikes)의 급증을 방지하며 탐색 능력도 향상시켰습니다. 애블레이션(ablation) 연구를 통해 모든 계층에 걸친 표현 수준(representation-level)의 섭동이 가장 효과적이며, 부분 계층 적용 변형이나 로짓(logits)만 적용한 변형보다 성능이 현저히 뛰어남을 확인했습니다.

English

Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration for LLM RL. To enhance inference efficiency, the distribution gap between the inference and updated policy grows, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates sharp gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation(ALP) by injecting small learnable perturbations into input hidden states of each layer during updates, which is used as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy, and enlarges the policy family to cover the inference policy family with mismatch noises. Hence, the flattened distribution can naturally tighten the updated and inference policy gap and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoid blow up of importance ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.

적응적 계층별 섭동: LLM 강화학습을 위한 오프-폴리시 보정 통합

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

초록

Support