VGGRPO: 4D潜在報酬による世界整合性のあるビデオ生成に向けて

要旨

大規模なビデオ拡散モデルは印象的な視覚的品質を達成する一方で、幾何学的な一貫性を維持できないことが多い。従来のアプローチでは、生成器に追加モジュールを組み込むか、幾何学を考慮したアライメントを適用することで一貫性を改善してきた。しかし、アーキテクチャの変更はインターネット規模で事前学習されたモデルの汎化性能を損なう可能性があり、既存のアライメント手法は静的なシーンに限定され、繰り返しのVAEデコードを必要とするRGB空間の報酬に依存するため、計算コストが大幅にかかり、動的な実世界シーンに汎化できない。事前学習済みモデルの能力を維持しつつ幾何学的な一貫性を向上させるため、我々はVGGRPO（Visual Geometry GRPO）を提案する。これは潜在空間における幾何学誘導型ビデオ事後学習フレームワークである。VGGRPOは、ビデオ拡散の潜在表現を幾何学基盤モデルと接続するLatent Geometry Model（LGM）を導入し、潜在空間から直接シーン幾何学をデコードすることを可能にする。4次元再構成能力を持つ幾何学モデルからLGMを構築することで、VGGRPOは動的シーンに自然に拡張され、従来手法の静的なシーン制限を克服する。これを基盤として、潜在空間におけるGroup Relative Policy Optimizationを二つの相補的な報酬で実行する：カメラ運動の滑らかさを評価する報酬（カクつきの多い軌道をペナルティ）と、異なる視点間の幾何学的コヒーレンスを強化する再投影一貫性報酬である。静的および動的ベンチマークでの実験により、VGGRPOがコストの高いVAEデコードを排除しつつ、カメラ安定性、幾何学的一貫性、全体的な品質を向上させることが示された。これにより、潜在空間における幾何学誘導型強化学習は、世界一貫性のあるビデオ生成に対する効率的かつ柔軟なアプローチとなっている。

English

Large-scale video diffusion models achieve impressive visual quality, yet often fail to preserve geometric consistency. Prior approaches improve consistency either by augmenting the generator with additional modules or applying geometry-aware alignment. However, architectural modifications can compromise the generalization of internet-scale pretrained models, while existing alignment methods are limited to static scenes and rely on RGB-space rewards that require repeated VAE decoding, incurring substantial compute overhead and failing to generalize to highly dynamic real-world scenes. To preserve the pretrained capacity while improving geometric consistency, we propose VGGRPO (Visual Geometry GRPO), a latent geometry-guided framework for geometry-aware video post-training. VGGRPO introduces a Latent Geometry Model (LGM) that stitches video diffusion latents to geometry foundation models, enabling direct decoding of scene geometry from the latent space. By constructing LGM from a geometry model with 4D reconstruction capability, VGGRPO naturally extends to dynamic scenes, overcoming the static-scene limitations of prior methods. Building on this, we perform latent-space Group Relative Policy Optimization with two complementary rewards: a camera motion smoothness reward that penalizes jittery trajectories, and a geometry reprojection consistency reward that enforces cross-view geometric coherence. Experiments on both static and dynamic benchmarks show that VGGRPO improves camera stability, geometry consistency, and overall quality while eliminating costly VAE decoding, making latent-space geometry-guided reinforcement an efficient and flexible approach to world-consistent video generation.

VGGRPO: 4D潜在報酬による世界整合性のあるビデオ生成に向けて

VGGRPO: Towards World-Consistent Video Generation with 4D Latent Reward

要旨

Support