VESPO: 안정적인 오프폴리시 LLM 학습을 위한 변분 시퀀스 레벨 소프트 정책 최적화

초록

대규모 언어 모델(LLM)의 강화 학습(RL)에서 훈련 안정성은 여전히 핵심적인 과제로 남아 있습니다. 정책 부실화, 비동기적 훈련, 그리고 훈련과 추론 엔진 간의 불일치는 모두 행동 정책이 현재 정책에서 이탈하게 하여 훈련 붕괴의 위험을 초래합니다. 중요도 샘플링은 이러한 분포 변화에 대한 원리적인 보정을 제공하지만 높은 분산 문제가 있으며, 토큰 수준 클리핑이나 시퀀스 수준 정규화와 같은 기존 해결책은 통합된 이론적 기반이 부족합니다. 본 연구에서는 Variational sEquence-level Soft Policy Optimization (VESPO)을 제안합니다. 분산 감소를 제안 분포에 대한 변분 형식화에 통합함으로써, VESPO는 길이 정규화 없이 시퀀스 수준 중요도 가중치에 직접 작용하는 폐쇄형 재형성 커널을 도출합니다. 수학적 추론 벤치마크 실험 결과, VESPO는 최대 64배의 부실화 비율과 완전한 비동기 실행 환경에서도 안정적인 훈련을 유지하며, 조밀 모델과 Mixture-of-Experts 모델 모두에서 일관된 성능 향상을 보여줍니다. 코드는 https://github.com/FloyedShen/VESPO에서 확인할 수 있습니다.

English

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at https://github.com/FloyedShen/VESPO

VESPO: 안정적인 오프폴리시 LLM 학습을 위한 변분 시퀀스 레벨 소프트 정책 최적화

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

초록

Support