VGGRPO: 4D 잠재 보상을 통한 세계 일관성 비디오 생성 방향

초록

대규모 비디오 확산 모델은 인상적인 시각적 품질을 달성하지만, 기하학적 일관성 유지에는 종종 실패합니다. 기존 접근법은 생성자에 추가 모듈을 통합하거나 기하학 인식 정렬을 적용하여 일관성을 향상시킵니다. 그러나 구조적 수정은 인터넷 규모 사전 학습 모델의 일반화를 저해할 수 있으며, 기존 정렬 방법은 정적 장면에 국한되고 반복적인 VAE 디코딩이 필요한 RGB 공간 보상에 의존하여 상당한 계산 오버헤드를 초래하고 역동적인 실세계 장면으로 일반화되지 못합니다. 사전 학습된 능력을 보존하면서 기하학적 일관성을 향상시키기 위해, 우리는 기하학 인식 비디오 사후 학습을 위한 잠재 공간 기하학 주도 프레임워크인 VGGRPO(Visual Geometry GRPO)를 제안합니다. VGGRPO는 비디오 확산 잠재 공간을 기하학 기초 모델에 연결하는 잠재 기하학 모델(LGM)을 도입하여 잠재 공간에서 직접 장면 기하학을 디코딩할 수 있게 합니다. 4D 재구성 능력을 가진 기하학 모델로 LGM을 구축함으로써, VGGRPO는 역동적인 장면으로 자연스럽게 확장되어 기존 방법의 정적 장면 한계를 극복합니다. 이를 기반으로 두 가지 상호 보완적인 보상(지터링이 심한 궤적에 패널티를 주는 카메라 운동 부드러움 보상과 시점 간 기하학적 일관성을 강화하는 기하학 재투영 일관성 보상)을 사용하여 잠재 공간 그룹 상대 정책 최적화를 수행합니다. 정적 및 동적 벤치마크에서의 실험 결과, VGGRPO는 값비싼 VAE 디코딩을 제거하면서 카메라 안정성, 기하학적 일관성 및 전반적인 품질을 향상시켜 잠재 공간 기하학 주도 강화 학습이 세계 일관성 비디오 생성에 효율적이고 유연한 접근법임을 입증했습니다.

English

Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.

VGGRPO: 4D 잠재 보상을 통한 세계 일관성 비디오 생성 방향

VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

초록

Support