ChatPaper.aiChatPaper

Lyra 2.0:可探索的生成式3D世界

Lyra 2.0: Explorable Generative 3D Worlds

April 14, 2026
作者: Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, Sanja Fidler, Jiahui Huang, Huan Ling, Jun Gao, Xuanchi Ren
cs.AI

摘要

近期影片生成技術的突破開啟了三維場景建構的新範式:透過生成可控制攝影機的場景漫遊影片,再以前饋重建技術將其轉化為三維模型。這種生成式重建方法結合了影片模型的視覺保真度、創作潛力,以及可直接用於即時渲染與模擬的三維輸出。要實現大規模複雜環境的建構,需在長攝影機軌跡中保持三維一致性,這涉及大幅視角變換與位置重訪,而現有影片模型在此設定下會快速劣化。現有長序列生成方法存在兩大根本性侷限:空間遺忘與時間漂移。隨著場景探索推進,已觀測區域會脫離模型的時序上下文,導致模型在重訪時被迫虛構結構;同時,自回歸生成會隨時間累積微小合成誤差,逐漸扭曲場景外觀與幾何。我們提出Lyra 2.0框架,用於大規模生成具持久性與可探索性的三維世界。為解決空間遺忘,我們維護逐幀三維幾何資訊,僅將其用於資訊路由——檢索相關歷史影格並與目標視角建立稠密對應關係,而外觀合成仍依賴生成先驗。針對時間漂移,我們採用自我增強歷史進行訓練,使模型接觸自身劣化輸出,從而學習修正而非傳播漂移誤差。這些技術共同實現了顯著延長且保持三維一致性的影片軌跡,據此我們對前饋重建模型進行微調,能穩定重建高品質三維場景。
English
Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model's temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing -- retrieving relevant past frames and establishing dense correspondences with the target viewpoints -- while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.
PDF172April 16, 2026