VESPO: 安定したオフポリシーLLMトレーニングのための変分シーケンスレベルソフトポリシー最適化

要旨

大規模言語モデル（LLM）における強化学習（RL）では、訓練の安定性が依然として中心的な課題である。ポリシーの陳腐化、非同期訓練、および訓練と推論エンジンの不整合は、すべて行動ポリシーが現在のポリシーから乖離する原因となり、訓練の崩壊リスクを伴う。重要度サンプリングはこの分布シフトに対する原理的な補正を提供するが、高い分散に悩まされており、トークンレベルのクリッピングやシーケンスレベルの正規化といった既存の対策は、統一された理論的基盤を欠いている。我々はVariational sEquence-level Soft Policy Optimization（VESPO）を提案する。提案分布に対する変分定式化に分散削減を組み込むことで、VESPOは長さの正規化を必要とせず、シーケンスレベルの重要度重みに直接作用する閉形式のリシェイピングカーネルを導出する。数学的推論ベンチマークによる実験では、VESPOが64倍までの陳腐化率および完全非同期実行下で安定した訓練を維持し、密モデルとMixture-of-Expertsモデルの両方で一貫した性能向上をもたらすことが示されている。コードはhttps://github.com/FloyedShen/VESPOで公開されている。

English

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at https://github.com/FloyedShen/VESPO

VESPO: 安定したオフポリシーLLMトレーニングのための変分シーケンスレベルソフトポリシー最適化

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

要旨

Support