Lyra 2.0:可探索的生成式3D世界
Lyra 2.0: Explorable Generative 3D Worlds
April 14, 2026
作者: Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, Sanja Fidler, Jiahui Huang, Huan Ling, Jun Gao, Xuanchi Ren
cs.AI
摘要
视频生成领域的最新进展为三维场景创建开辟了新范式:通过生成可控制摄像机的场景漫游模拟视频,再借助前馈重建技术将其提升为三维模型。这种生成式重建方法融合了视频模型的视觉保真度与创作能力,并能输出适用于实时渲染与仿真的三维内容。要实现大规模复杂环境的生成,需要在长镜头轨迹中保持三维一致性,这涉及大幅视角变化和场景重访,而现有视频模型在此类设定下性能会迅速衰退。当前长序列生成方法主要受限于两种退化形式:空间遗忘与时间漂移。随着探索推进,已观测区域会脱离模型的时间上下文,导致模型在重访时被迫虚构结构;同时,自回归生成会随时间累积微小合成误差,逐渐扭曲场景外观与几何。我们提出Lyra 2.0框架,用于生成可持久探索的大规模三维世界。针对空间遗忘问题,我们维护逐帧三维几何信息并仅用于数据路由——检索相关历史帧并与目标视点建立稠密对应关系——而外观合成仍依赖生成先验。为解决时间漂移,我们采用自增强历史数据进行训练,使模型接触自身退化输出,从而学会纠正而非传播漂移。这些技术共同实现了显著延长且保持三维一致性的视频轨迹,并以此微调前馈重建模型,可靠地复原高质量三维场景。
English
Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model's temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing -- retrieving relevant past frames and establishing dense correspondences with the target viewpoints -- while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.