KVPO: ODEネイティブGRPOによるKVセマンティック探索を介した自己回帰的ビデオアライメント

要旨

ストリーミング自己回帰（AR）動画生成器を人間の嗜好に合わせることは困難である。既存の強化学習手法は主にノイズベースの探索とSDEベースの代理ポリシーに依存しており、これらは蒸留ARモデルの決定論的なODEダイナミクスと適合せず、長期的な一貫性に重要な高レベルの意味的ストーリー進行ではなく、低レベルの外観を摂動させる傾向がある。これらの限界に対処するため、我々はKVPOを提案する。これはストリーミング動画生成器を調整するための、ODEネイティブなオンラインGroup Relative Policy Optimization（GRPO）フレームワークである。多様性探索のために、KVPOは因果的意味的探索パラダイムを導入する。これは変動の源泉を確率的ノイズから過去のKVキャッシュに移す。過去のKVエントリを確率的にルーティングすることで、データ多様体上に厳密に留まる意味的に多様な生成ブランチを構築する。ポリシモデリングのために、KVPOはTrajectory Velocity Energy（TVE）に基づく速度場代理ポリシーを導入する。TVEはフローマッチング速度空間におけるブランチ尤度を定量化し、ネイティブなODE定式化と完全に整合した報酬重み付け対比目的関数をもたらす。複数の蒸留AR動画生成器に対する実験により、単一プロンプトの短編動画および複数プロンプトの長編動画の両設定において、画質、動き品質、テキスト-動画アライメントで一貫した改善が示された。

English

Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.