리라 2.0: 탐색 가능한 생성형 3D 세계

초록

최근 비디오 생성 기술의 발전으로 3D 장면 생성에 새로운 패러다임이 등장했습니다: 장면 둘러보기를 시뮬레이션하는 카메라 제어 비디오를 생성한 후, 순전파 기반 복원 기술을 통해 3D로 변환하는 방식입니다. 이러한 생성적 복원 접근법은 비디오 모델의 시각적 정확도와 창의적 능력을 실시간 렌더링 및 시뮬레이션에 바로 활용할 수 있는 3D 출력과 결합합니다. 대규모 복잡 환경으로 확장하기 위해서는 시점 변화와 위치 재방문이 많은 긴 카메라 궤적에 걸쳐 3D 일관성을 유지하는 비디오 생성이 필요하지만, 현재 비디오 모델들은 이러한 설정에서 빠르게 성능이 저하됩니다. 장기적 생성 기법은 공간 망각과 시간적 드리프트라는 두 가지 형태의 저하 현상에 근본적으로 제한받습니다. 탐색이 진행됨에 따라 이전에 관찰된 영역이 모델의 시간적 문맥 범위를 벗어나, 재방문 시 구조를 허구적으로 생성하게 만듭니다. 동시에 자기회귀적 생성은 시간이 지남에 따라 작은 합성 오류를 누적시켜 장면의 외관과 기하구조를 점점 왜곡시킵니다. 우리는 대규모로 지속적이고 탐색 가능한 3D 세계를 생성하기 위한 프레임워크인 Lyra 2.0을 제시합니다. 공간 망각 문제를 해결하기 위해 프레임별 3D 기하구조를 유지하고 이를 정보 라우팅(관련 과거 프레임 검색 및 대상 시점과의 조밀한 대응 관계 설정)에만 활용하면서, 외관 합성에는 생성적 사전 지식에 의존합니다. 시간적 드리프트 문제를 해결하기 위해 모델이 자체 저하된 출력을 접하도록 자기 증강 기록으로 학습시켜 드리프트를 전파하지 않고 수정하는 방법을 가르칩니다. 이러한 접근법을 결합하여 훨씬 길고 3D 일관성이 있는 비디오 궤적을 구현하며, 이를 통해 고품질 3D 장면을 안정적으로 복원하는 순전파 복원 모델을 미세 조정하는 데 활용합니다.

English

Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model's temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing -- retrieving relevant past frames and establishing dense correspondences with the target viewpoints -- while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.

리라 2.0: 탐색 가능한 생성형 3D 세계

Lyra 2.0: Explorable Generative 3D Worlds

초록

Support