Lyra 2.0: 探索可能な生成AIによる3Dワールド

要旨

ビデオ生成の最近の進歩により、3Dシーン作成の新たなパラダイムが可能となった：シーンの歩行シミュレーションを行うカメラ制御ビデオを生成し、フィードフォワード型の復元技術によって3Dへ変換するという手法である。この生成的復元アプローチは、ビデオモデルが持つ視覚的な忠実度と創造的容量を、リアルタイムレンダリングやシミュレーションに適した3D出力と組み合わせる。大規模で複雑な環境へのスケーリングには、大きな視点変化や位置の再訪を伴う長いカメラ軌道にわたる3D一貫性のあるビデオ生成が求められるが、これは現在のビデオモデルが急速に劣化する設定である。長期的生成の既存手法は、空間的忘却と時間的ドリフトという二つの劣化形態によって根本的に制限されている。探索が進むにつれ、以前に観測された領域はモデルの時間的コンテキストから外れ、再訪時にモデルは構造を幻覚的に生成せざるを得なくなる。一方、自己回帰的生成は時間の経過とともに小さな合成誤差を蓄積し、シーンの外観と幾何学を徐々に歪ませる。我々はLyra 2.0を提案する。これは大規模な永続的で探索可能な3D世界を生成するフレームワークである。空間的忘却に対処するため、フレームごとの3Dジオメトリを維持し、情報ルーティング専用に利用する——関連する過去フレームを検索し、目標視点との密な対応関係を確立する——一方で外観合成には生成的プリオールに依存する。時間的ドリフトに対処するため、モデル自身の劣化した出力をモデルに曝露する自己増強履歴を用いて学習し、ドリフトを伝播させるのではなく修正することを教える。これらを組み合わせることで、大幅に長く3D一貫性のあるビデオ軌道を実現し、これを利用して高品質な3Dシーンを確実に復元するフィードフォワード型復元モデルをファインチューニングする。

English

Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model's temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing -- retrieving relevant past frames and establishing dense correspondences with the target viewpoints -- while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.

Lyra 2.0: 探索可能な生成AIによる3Dワールド

Lyra 2.0: Explorable Generative 3D Worlds

要旨

Support