VESPO：基於變分序列層級軟策略優化的穩定離線策略大型語言模型訓練

摘要

訓練穩定性始終是大語言模型強化學習領域的核心挑戰。策略陳舊、異步訓練以及訓練與推理引擎之間的失配，都會導致行為策略偏離當前策略，從而引發訓練崩潰風險。重要性抽樣雖能為這種分佈偏移提供理論校正，但存在高方差缺陷；現有的令牌級截斷和序列級歸一化等補救措施缺乏統一理論基礎。我們提出變分序列級軟策略優化方法VESPO。通過將方差縮減技術融入提案分佈的變分框架，VESPO推導出可直接作用於序列級重要性權重的閉式重塑核函數，無需進行長度歸一化處理。在數學推理基準測試中，VESPO能在高達64倍陳舊率和全異步執行環境下保持訓練穩定性，並在稠密模型與專家混合模型中均實現性能持續提升。程式碼已開源於https://github.com/FloyedShen/VESPO。

English

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at https://github.com/FloyedShen/VESPO

VESPO：基於變分序列層級軟策略優化的穩定離線策略大型語言模型訓練

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

摘要

Support