KVPO: KV 의미 탐색을 통한 자기회귀 비디오 정렬을 위한 ODE 네이티브 GRPO

초록

스트리밍 자기회귀(AR) 비디오 생성기를 인간의 선호도에 맞추는 것은 어려운 과제이다. 기존 강화 학습 방법은 주로 잡음 기반 탐색과 정류된 AR 모델의 결정론적 상미분방정식(ODE) 동역학과 부합하지 않는 SDE 기반 대리 정책에 의존하며, 장기적 일관성에 중요한 고수준 의미론적 스토리라인 진행보다는 저수준 외형을 교란하는 경향이 있다. 이러한 한계를 극복하기 위해, 우리는 스트리밍 비디오 생성기를 정렬하기 위한 ODE-고유 온라인 그룹 상대 정책 최적화(GRPO) 프레임워크인 KVPO를 제안한다. 다양성 탐색을 위해, KVPO는 변동의 원천을 확률적 잡음에서 역사적 KV 캐시로 이동시키는 인과-의미론적 탐색 패러다임을 도입한다. 역사적 KV 엔트리를 확률적으로 라우팅함으로써, 데이터 다양체에 엄격히 머물면서 의미론적으로 다양한 생성 분기를 구성한다. 정책 모델링을 위해, KVPO는 궤적 속도 에너지(TVE)에 기반한 속도장 대리 정책을 도입한다. 이는 흐름 정합 속도 공간에서 분기 가능성을 정량화하고, 고유 ODE 공식과 완전히 일관된 보상 가중 대비 목적 함수를 생성한다. 여러 정류된 AR 비디오 생성기에 대한 실험에서 단일 프롬프트 짧은 비디오 및 다중 프롬프트 긴 비디오 설정 모두에서 시각적 품질, 모션 품질, 텍스트-비디오 정렬에서 일관된 성능 향상이 입증되었다.

English

Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.