生成的視点から探る空間知能

要旨

空間的知能はマルチモーダル大規模言語モデルにとって不可欠であるが、現行のベンチマークは主に理解の観点からのみ評価を行っている。我々は、現代の生成的または統一マルチモーダルモデルが、画像生成時に3D空間的制約を尊重し操作する能力である生成的空間的知能（GSI）も有するか、そしてそのような能力が測定または改善可能かどうかを問う。我々はGSI-Benchを提案する。これは空間的に接地された画像編集を通じてGSIを定量化する初のベンチマークであり、二つの相補的要素で構成される：3D事前知識誘導生成とフィルタリングパイプラインで構築された高品質実世界データセットGSI-Realと、制御可能な空間操作と完全自動化ラベリングを備えた大規模合成ベンチマークGSI-Synである。統一評価プロトコルと併せ、GSI-Benchは空間的適合性と編集忠実度のスケーラブルでモデル非依存の評価を可能にする。実験により、統一マルチモーダルモデルをGSI-Synでファインチューニングすると、合成タスクと実タスクの両方で大幅な改善が見られ、顕著にも下流の空間理解も向上することが示された。これは生成的訓練が空間推論を具体的に強化し得る初の明確な証拠を提供し、マルチモーダルモデルの空間的知能を進展させる新たな経路を確立する。

English

Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.

生成的視点から探る空間知能

Exploring Spatial Intelligence from a Generative Perspective

要旨

Support