SpaceDG: 視覚劣化下における空間知能のベンチマーキング

要旨

マルチモーダル大規模言語モデル（MLLMs）は空間知能において急速な進歩を遂げているが、既存の空間推論ベンチマークの多くは、ほぼ完全な視覚入力を前提としており、実世界の展開でよく発生する劣化（モーションブラー、低照度、悪天候、レンズ歪み、圧縮アーティファクトなど）を無視している。このことは、根本的な疑問を提起する：視覚観測が不完全な場合、現在のMLLMsの空間知能はどの程度堅牢なのか？この疑問に答えるために、我々は劣化対応空間理解のための初の大規模データセットであるSpaceDGを導入する。これは、物理的に基づいた劣化合成エンジンを用いて構築されており、劣化形成プロセスを3Dガウシアンスプラッティング（3DGS）レンダリングに組み込むことで、9種類の劣化を現実的にシミュレートする。結果として得られたデータセットは、約1,000の屋内シーンから約100万のQAペアを含む。さらに、11の推論カテゴリと9の視覚劣化タイプにわたる1,102の質問からなる人間検証済みベンチマークSpaceDG-Benchを導入し、1万以上のVQAインスタンスを生成する。25のオープンソースおよびクローズドソースのMLLMを評価した結果、視覚劣化が一貫してかつ大幅に空間推論を損ない、重要な堅牢性のギャップが明らかになった。最後に、SpaceDGでのファインチューニングにより劣化に対する堅牢性が著しく向上し、クリーンな画像での性能低下なしに劣化条件下で人間の性能を超えることさえ可能になることを示す。これは、堅牢な空間知能のための劣化対応トレーニングの有望性を強調する。

English

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.