ChatPaper.aiChatPaper

SpaceVista:從毫米到公里的全尺度視覺空間推理

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

October 10, 2025
作者: Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue
cs.AI

摘要

隨著空間推理探索的當前熱潮,研究人員在理解室內場景方面取得了顯著進展,但在機器人和自動駕駛等多樣化應用中仍面臨挑戰。本文旨在通過解決兩個關鍵問題來推進跨多樣場景的全尺度空間推理:1)對室內3D掃描和耗時的手動註釋進行數據集構建的嚴重依賴;2)缺乏有效的全尺度場景建模,這往往導致對單個場景的過度擬合。本文提出了一種整合結構化空間推理知識系統、尺度感知建模和漸進式訓練範式的整體解決方案,作為首次嘗試,據我們所知,擴展了多模態大語言模型(MLLMs)的全尺度空間智能。利用任務特定、專家驅動的自動化流程,我們構建了跨越5個空間尺度的超過38K個視頻場景,創建了SpaceVista-1M數據集,該數據集包含約100萬個空間問答對,涵蓋19種多樣任務類型。雖然專家模型可以注入有用的領域知識,但它們在評估方面並不可靠。因此,我們通過手動記錄、檢索和組裝基於視頻的數據,構建了一個具有精確註釋的全尺度基準。然而,由於潛在的知識衝突,使用SpaceVista-1M進行簡單訓練往往效果不佳。為此,我們引入了SpaceVista-7B,這是一個接受超越語義的密集輸入並以尺度為錨點進行尺度感知專家和漸進獎勵的空間推理模型。最後,在包括我們的SpaceVista-Bench在內的5個基準上的廣泛評估展示了競爭性能,顯示出在所有尺度和場景中的強大泛化能力。我們的數據集、模型和基準將發佈在https://peiwensun2000.github.io/mm2km。
English
With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .
PDF173October 13, 2025