SimRecon: 실제 영상으로부터의 SimReady 구성적 장면 재구성

초록

구성적 장면 재구성은 실세계 비디오로부터 전체론적 장면이 아닌 객체 중심 표현을 생성하는 것을 목표로 하며, 이는 시뮬레이션 및 상호작용에 자연스럽게 적용 가능합니다. 기존의 구성적 재구성 접근법은 주로 시각적 외관에 중점을 두고 실세계 시나리오에 대한 일반화 능력이 제한적입니다. 본 논문에서는 복잡한 장면 재구성을 위한 '인지-생성-시뮬레이션' 파이프라인을 구현하는 SimRecon 프레임워크를 제안합니다. 이는 비디오 입력으로부터 먼저 장면 수준의 의미론적 재구성을 수행하고, 단일 객체 생성을 진행한 후, 최종적으로 이러한 자산들을 시뮬레이터 내에서 조립합니다. 그러나 이 세 단계를 단순히 결합할 경우 생성된 자산의 시각적 부정확성과 최종 장면의 물리적 비합리성이 발생하며, 이는 복잡한 장면에서 특히 심각한 문제입니다. 따라서 본 논문은 이 문제를 해결하기 위해 세 단계 사이에 두 개의 연결 모듈을 추가로 제안합니다. 구체적으로, 시각적 정확도에 중요한 인지에서 생성으로의 전환을 위해, 단일 객체 완성을 위한 조건으로 최적의 투영 이미지를 획득하기 위해 3D 공간에서 능동적으로 탐색하는 능동적 시점 최적화를 도입합니다. 더 나아가, 물리적 합리성에 필수적인 생성에서 시뮬레이션으로의 전환을 위해, 실세계의 본질적이고 구성적인 원리를 반영하여 3D 시뮬레이터 내에서 처음부터 구성을 안내하는 장면 그래프 합성기를 제안합니다. ScanNet 데이터셋에 대한 광범위한 실험을 통해 본 방법이 기존 최신 접근법들을 능가하는 우수한 성능을 보임을 입증합니다.

English

Compositional scene reconstruction seeks to create object-centric representations rather than holistic scenes from real-world videos, which is natively applicable for simulation and interaction. Conventional compositional reconstruction approaches primarily emphasize on visual appearance and show limited generalization ability to real-world scenarios. In this paper, we propose SimRecon, a framework that realizes a "Perception-Generation-Simulation" pipeline towards cluttered scene reconstruction, which first conducts scene-level semantic reconstruction from video input, then performs single-object generation, and finally assembles these assets in the simulator. However, naively combining these three stages leads to visual infidelity of generated assets and physical implausibility of the final scene, a problem particularly severe for complex scenes. Thus, we further propose two bridging modules between the three stages to address this problem. To be specific, for the transition from Perception to Generation, critical for visual fidelity, we introduce Active Viewpoint Optimization, which actively searches in 3D space to acquire optimal projected images as conditions for single-object completion. Moreover, for the transition from Generation to Simulation, essential for physical plausibility, we propose a Scene Graph Synthesizer, which guides the construction from scratch in 3D simulators, mirroring the native, constructive principle of the real world. Extensive experiments on the ScanNet dataset validate our method's superior performance over previous state-of-the-art approaches.

SimRecon: 실제 영상으로부터의 SimReady 구성적 장면 재구성

SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

초록

Support