從生成視角探索空間智能

摘要

空間智能對多模態大型語言模型至關重要，然而現有基準大多僅從理解層面進行評估。我們探討現代生成式或統一多模態模型是否同樣具備生成式空間智能（GSI）——即在圖像生成過程中尊重並操控三維空間約束的能力，以及此種能力能否被量化或提升。我們提出GSI-Bench，首個通過空間錨定圖像編輯來量化GSI的基準，包含兩個互補組件：採用三維先驗引導生成與篩選流程構建的高質量真實數據集GSI-Real，以及具備可控空間操作與全自動標註的大規模合成基準GSI-Syn。結合統一評估協議，GSI-Bench能實現可擴展、模型無關的空間順從性與編輯保真度評測。實驗表明，基於GSI-Syn對統一多模態模型進行微調，不僅在合成與真實任務上取得顯著增益，更引人注目的是還能提升下游空間理解能力。這首次明確證實生成式訓練能切實強化空間推理，為推進多模態模型的空間智能開闢了新路徑。

English

Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.

從生成視角探索空間智能

Exploring Spatial Intelligence from a Generative Perspective

摘要

Support