LLM 강화 학습에서의 발산 정규화 재고

초록

강화 학습(RL)은 대규모 언어 모델(LLM)의 사후 훈련(post-training)에서 핵심 구성 요소로 자리 잡았다. 실제로 LLM RL은 훈련-추론 불일치(training-inference mismatch)와 정책 지연(policy staleness)으로 인해 종종 오프-폴리시(off-policy)로 수행되며, 안정적인 최적화를 위해서는 신뢰 영역(trust-region) 제어가 필수적이다. PPO와 GRPO와 같은 주류 방법은 비율 클리핑(ratio-clipping) 메커니즘으로 이 제어를 근사하지만, 중요도 비율(importance ratio)은 긴 꼬리(long-tailed) 어휘에서 분포 변화의 좋은 대리 변수가 아닐 수 있다. 최근 DPPO와 같은 연구는 비율 기반 클리핑을 발산 기반 마스크(divergence-based mask)로 대체하여 이러한 불일치를 해결하며, 샘플링된 토큰의 절대 확률 변화로 정의된 신뢰 영역을 제공한다. 그러나 DPPO는 여전히 하드 마스크(hard mask)에 의존한다. 즉, 토큰이 유해한 방향으로 신뢰 영역 경계를 넘으면 그래디언트가 수정되지 않고 폐기된다. 이 문제를 해결하기 위해 우리는 발산 정규화 정책 최적화(DRPO)를 제안한다. DRPO는 하드 마스크를 정책 변화(policy shift)에 대한 평활한 가중 이차 정규화기(advantage-weighted quadratic regularizer)로 대체한다. DRPO는 DPPO와 동일한 신뢰 영역 기하학을 유지하면서, 경계를 넘는 업데이트를 약화시키고 경계 너머에서도 수정 신호를 제공하는 유계이며 연속적인 그래디언트 가중치를 유도한다. 다양한 모델 규모, 아키텍처 및 정밀도 설정에서 수행된 실험은 DRPO가 LLM RL 훈련의 안정성과 효율성을 개선함을 보여준다.

English

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.