생성 모델은 공간을 안다: 장면 이해를 위한 내재적 3D 사전 지식 활용

초록

멀티모달 대규모 언어 모델은 인상적인 의미론적 능력을 보여주지만, 종종 공간적 맹점을 겪으며 세밀한 기하학적 추론과 물리적 역학 이해에 어려움을 느낍니다. 기존 해결책들은 일반적으로 명시적 3D 모달리티나 복잡한 기하학적 구조에 의존하는데, 이는 데이터 부족과 일반화의 한계에 직면해 있습니다. 본 연구에서는 대규모 비디오 생성 모델 내에 내재된 공간적 사전 지식을 활용하여 패러다임 전환을 제안합니다. 우리는 시간적으로 일관된 비디오를 합성하기 위해 이러한 모델이 본질적으로 강력한 3D 구조적 사전 지식과 물리 법칙을 학습한다고 가정합니다. 우리는 사전 훈련된 비디오 확산 모델을 잠재 세계 시뮬레이터로 재활용하는 플러그앤플레이 프레임워크인 VEGA-3D(Video Extracted Generative Awareness)를 소개합니다. 중간 노이즈 수준에서 시공간 특징을 추출하고 토큰 수준 적응형 게이트 융합 메커니즘을 통해 의미론적 표현과 통합함으로써, 명시적 3D 감독 없이도 MLLM에 풍부한 기하학적 단서를 제공합니다. 3D 장면 이해, 공간 추론, 구현체 조작 벤치마크에 걸친 포괄적인 실험을 통해 우리의 방법이 최첨단 기준선을 능가함을 입증하였으며, 생성적 사전 지식이 물리 세계 이해를 위한 확장 가능한 기반을 제공함을 검증했습니다. 코드는 https://github.com/H-EmbodVis/VEGA-3D에서 공개되어 있습니다.

English

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

생성 모델은 공간을 안다: 장면 이해를 위한 내재적 3D 사전 지식 활용

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

초록

Support