KVPO：基於ODE原生的GRPO在KV語義探索下的自迴歸視頻對齊

摘要

將串流自回歸（AR）影片生成器與人類偏好進行對齊是一項挑戰。現有的強化學習方法主要依賴基於噪聲的探索及基於SDE的代理策略，這些方法與蒸餾AR模型的確定性ODE動態不匹配，且傾向於擾動低層外觀，而非對長期連貫性至關重要的高層語義故事線進展。為解決這些限制，我們提出KVPO——一個專為串流影片生成器對齊設計的ODE原生線上群組相對策略優化（GRPO）框架。在多樣性探索方面，KVPO引入因果語義探索範式，將變異的來源從隨機噪聲重新定位至歷史KV快取。透過隨機路由歷史KV條目，該方法建構出語義多樣的生成分支，且這些分支嚴格位於資料流形上。在策略建模方面，KVPO基於軌跡速度能量（TVE）提出速度場代理策略，該策略在流匹配速度空間中量化分支可能性，並產出與原生ODE公式完全一致的獎勵加權對比目標。在多個蒸餾AR影片生成器上的實驗顯示，無論是單提示短影片還是多提示長影片設定，KVPO在視覺品質、運動品質及文字-影片對齊方面均取得一致提升。

English

Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.