生成モデルは空間を理解する：暗黙的な3D事前分布を活用したシーン理解

要旨

マルチモーダル大規模言語モデルは印象的な意味理解能力を示す一方で、空間的盲目性に悩まされ、細粒度の幾何学的推論や物理的ダイナミクスの処理に苦戦することが多い。既存の解決策は通常、明示的な3Dモダリティや複雑な幾何学的足場に依存しているが、これらはデータ不足と汎化課題によって制限されている。本研究では、大規模ビデオ生成モデル内に内在する空間事前知識を活用することで、パラダイムシフトを提案する。時間的に一貫性のあるビデオを合成するために、これらのモデルは本質的に頑健な3D構造事前知識と物理法則を学習していると仮定する。我々はVEGA-3D（Video Extracted Generative Awareness）を提案する。これはプレトレーニング済みビデオ拡散モデルを潜在世界シミュレータとして再利用するプラグアンドプレイフレームワークである。中間ノイズレベルから時空間特徴を抽出し、トークンレベルの適応型ゲート融合機構を介して意味表現と統合することで、明示的な3D教師信号なしでMLLMに高密度な幾何学的手がかりを付与する。3Dシーン理解、空間推論、具身体験操作ベンチマークにおける大規模な実験により、本手法が最先端ベースラインを凌駕することを実証し、生成的事前知識が物理世界理解のためのスケーラブルな基盤を提供することを検証した。コードはhttps://github.com/H-EmbodVis/VEGA-3Dで公開されている。

English

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

生成モデルは空間を理解する：暗黙的な3D事前分布を活用したシーン理解

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

要旨

Support