SpaceDG: 시각적 저하 환경에서의 공간 지능 벤치마킹

초록

멀티모달 대규모 언어 모델(MLLM)은 공간 지능 분야에서 급속한 진전을 이루었지만, 기존의 공간 추론 벤치마크는 대부분 깨끗한 시각적 입력을 가정하며 실제 배포에서 흔히 발생하는 열화(예: 모션 블러, 저조도, 악천후, 렌즈 왜곡, 압축 아티팩트)를 간과한다. 이는 근본적인 질문을 제기한다: 시각적 관찰이 불완전할 때 현재 MLLM의 공간 지능은 얼마나 강건한가? 이 질문에 답하기 위해, 우리는 열화 인지 공간 이해를 위한 최초의 대규모 데이터셋인 SpaceDG를 소개한다. 이 데이터셋은 물리적으로 기반한 열화 합성 엔진으로 구축되었으며, 열화 형성 과정을 3D 가우시안 스플래팅(3DGS) 렌더링에 내장하여 9가지 열화 유형을 사실적으로 시뮬레이션한다. 결과 데이터셋은 약 1,000개의 실내 장면에서 약 100만 개의 QA 쌍을 포함한다. 또한, 11가지 추론 범주와 9가지 시각적 열화 유형에 걸친 1,102개의 질문으로 구성된 인간 검증 벤치마크인 SpaceDG-Bench를 도입하여, 10,000개 이상의 VQA 인스턴스를 생성한다. 25개의 오픈소스 및 클로즈드소스 MLLM을 평가한 결과, 시각적 열화가 공간 추론을 일관되게 심각하게 손상시켜 중요한 강건성 격차를 드러냄을 발견했다. 마지막으로, SpaceDG에 대한 미세 조정이 열화 강건성을 현저히 향상시키며, 깨끗한 이미지에서 성능 저하 없이 열화 조건에서 인간 성능을 능가할 수 있음을 보여주어, 강건한 공간 지능을 위한 열화 인지 훈련의 가능성을 강조한다.

English

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.