KVPO: 基于ODE原生的GRPO方法用于通过KV语义探索实现自回归视频对齐

摘要

将流式自回归（AR）视频生成器与人类偏好对齐具有挑战性。现有的强化学习方法主要依赖基于噪声的探索和基于SDE的替代策略，但这些方法与蒸馏AR模型确定性的ODE动力学不匹配，且倾向于扰动低级外观而非高级语义情节发展——而后者对于长程连贯性至关重要。为解决这些局限，我们提出KVPO，一种面向ODE的在线群体相对策略优化（GRPO）框架，用于对齐流式视频生成器。在多样性探索方面，KVPO引入因果语义探索范式，将变异源从随机噪声迁移至历史KV缓存。通过随机路由历史KV条目，它构建出严格位于数据流形上的语义多样生成分支。在策略建模方面，KVPO提出基于轨迹速度能量（TVE）的速度场替代策略，该策略在流匹配速度空间中量化分支似然，并产生与原生ODE公式完全一致的奖励加权对比目标。在多个蒸馏AR视频生成器上的实验表明，KVPO在单提示短视频和多提示长视频场景下，均在视觉质量、运动质量以及文本-视频对齐方面取得了一致性提升。

English

Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.