VESPO：面向稳定离策略大语言模型训练的变分序列级软策略优化

摘要

训练稳定性始终是大语言模型强化学习中的核心挑战。策略滞后、异步训练以及训练与推理引擎的不匹配，都会导致行为策略与当前策略产生偏差，进而引发训练崩溃风险。重要性采样为这种分布偏移提供了理论修正方案，但存在高方差问题；现有的令牌级裁剪和序列级归一化等改进方法缺乏统一的理论基础。我们提出变分序列级软策略优化方法。通过将方差缩减融入提案分布的变分框架，该方法推导出可直接作用于序列级重要性权重且无需长度归一化的闭式重塑核。数学推理基准测试表明，该方法在高达64倍的策略滞后率和完全异步执行环境下仍能保持训练稳定，并在稠密模型与专家混合模型上均取得持续性能提升。代码已开源：https://github.com/FloyedShen/VESPO

English

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at https://github.com/FloyedShen/VESPO