SpaceDG:视觉退化下的空间智能基准测试
SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
May 21, 2026
作者: Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu, Yuanyuan Gao, Le Ma, Xue Yang, Hongjie Zhang, Zhihang Zhong
cs.AI
摘要
多模态大语言模型(MLLMs)在空间智能领域取得了快速进展,然而现有的空间推理基准测试大多假设输入为原始视觉数据,忽略了实际部署中常见的图像退化现象,如运动模糊、低光照、恶劣天气、镜头畸变和压缩伪影。这引出了一个根本性问题:当视觉观测不完美时,当前MLLMs的空间智能有多鲁棒?为回答该问题,我们提出了SpaceDG——首个面向退化感知的规模化空间理解数据集。该数据集基于物理驱动的退化合成引擎构建,该引擎将退化形成过程嵌入3D高斯溅射(3DGS)渲染中,实现了九种退化类型的逼真模拟。最终数据集包含来自近1000个室内场景的大约100万问答对。我们进一步推出了SpaceDG-Bench,一个经人工验证的基准测试,涵盖11个推理类别和9种视觉退化类型的1102道问题,生成了超过1万个VQA实例。对25个开源和闭源MLLMs的评估表明,视觉退化会持续且显著地损害空间推理能力,暴露出关键鲁棒性差距。最后,我们证明在SpaceDG上进行微调能显著提升退化鲁棒性,甚至在退化条件下超越人类表现,同时不影响干净图像上的性能,这凸显了退化感知训练对实现鲁棒空间智能的潜力。
English
Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.