Voyager: 探査可能な3Dシーン生成のための長距離・世界整合性ビデオ拡散モデル

要旨

ビデオゲームやバーチャルリアリティなどの実世界のアプリケーションでは、ユーザーがカスタムカメラ軌道に沿って探索可能な3Dシーンをモデル化する能力が求められることが多い。テキストや画像から3Dオブジェクトを生成する分野では大きな進展が見られるものの、長距離にわたる3D整合性を持ち、探索可能な3Dシーンを作成することは依然として複雑で困難な問題である。本研究では、Voyagerという新しいビデオ拡散フレームワークを提案する。このフレームワークは、単一の画像とユーザー定義のカメラパスから、世界整合性のある3D点群シーケンスを生成する。既存のアプローチとは異なり、Voyagerはエンドツーエンドのシーン生成と再構築を実現し、フレーム間の内在的な整合性を確保することで、構造復元（Structure-from-Motion）や多視点ステレオ（Multi-View Stereo）などの3D再構築パイプラインを不要にする。本手法は以下の3つの主要なコンポーネントを統合している：1) 世界整合性ビデオ拡散：既存の世界観測を条件として、整列したRGBと深度ビデオシーケンスを共同生成する統一アーキテクチャ、2) 長距離世界探索：ポイントカリングを備えた効率的なワールドキャッシュと、コンテキストを意識した整合性を保ちながらシーンを反復的に拡張するためのスムーズなビデオサンプリングを可能にする自己回帰推論、3) スケーラブルなデータエンジン：任意のビデオに対するカメラポーズ推定とメトリック深度予測を自動化するビデオ再構築パイプライン。これにより、大規模で多様なトレーニングデータのキュレーションを手動の3Dアノテーションなしで実現する。これらの設計を組み合わせることで、視覚品質と幾何学的精度において既存手法を明確に上回り、多様な応用が可能となる。

English

Real-world applications like video gaming and virtual reality often demand the ability to model 3D scenes that users can explore along custom camera trajectories. While significant progress has been made in generating 3D objects from text or images, creating long-range, 3D-consistent, explorable 3D scenes remains a complex and challenging problem. In this work, we present Voyager, a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image with user-defined camera path. Unlike existing approaches, Voyager achieves end-to-end scene generation and reconstruction with inherent consistency across frames, eliminating the need for 3D reconstruction pipelines (e.g., structure-from-motion or multi-view stereo). Our method integrates three key components: 1) World-Consistent Video Diffusion: A unified architecture that jointly generates aligned RGB and depth video sequences, conditioned on existing world observation to ensure global coherence 2) Long-Range World Exploration: An efficient world cache with point culling and an auto-regressive inference with smooth video sampling for iterative scene extension with context-aware consistency, and 3) Scalable Data Engine: A video reconstruction pipeline that automates camera pose estimation and metric depth prediction for arbitrary videos, enabling large-scale, diverse training data curation without manual 3D annotations. Collectively, these designs result in a clear improvement over existing methods in visual quality and geometric accuracy, with versatile applications.

Voyager: 探査可能な3Dシーン生成のための長距離・世界整合性ビデオ拡散モデル

Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation

要旨

Support