ChatPaper.aiChatPaper

SpaceVista:从毫米到千米的全尺度视觉空间推理

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

October 10, 2025
作者: Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, Xiangyu Yue
cs.AI

摘要

随着空间推理探索的迅猛发展,研究人员在理解室内场景方面取得了显著进展,但在机器人技术和自动驾驶等多样化应用领域仍面临挑战。本文旨在通过解决两个关键问题,推动跨多样场景的全尺度空间推理:1)对室内3D扫描和劳动密集型手动标注的数据集构建的过度依赖;2)缺乏有效的全尺度场景建模,这往往导致对单个场景的过拟合。本文首次尝试通过整合结构化空间推理知识体系、尺度感知建模和渐进式训练范式,扩展多模态大语言模型(MLLMs)的全尺度空间智能。利用任务特定、专家驱动的自动化流程,我们构建了跨越5个空间尺度的超过38K个视频场景,创建了SpaceVista-1M数据集,该数据集包含约100万对空间问答,涵盖19种不同任务类型。尽管专家模型能够注入有用的领域知识,但其评估可靠性不足。因此,我们通过手动记录、检索和组装基于视频的数据,构建了一个具有精确标注的全尺度基准。然而,直接使用SpaceVista-1M进行训练往往因潜在的知识冲突而效果不佳。为此,我们提出了SpaceVista-7B,一个接受超越语义的密集输入,并以尺度为锚点进行尺度感知专家和渐进奖励的空间推理模型。最后,在包括SpaceVista-Bench在内的5个基准上的广泛评估展示了其竞争力,证明了其在所有尺度和场景中的强大泛化能力。我们的数据集、模型和基准将发布于https://peiwensun2000.github.io/mm2km。
English
With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios by tackling two key challenges: 1) the heavy reliance on indoor 3D scans and labor-intensive manual annotations for dataset curation; 2) the absence of effective all-scale scene modeling, which often leads to overfitting to individual scenes. In this paper, we introduce a holistic solution that integrates a structured spatial reasoning knowledge system, scale-aware modeling, and a progressive training paradigm, as the first attempt to broaden the all-scale spatial intelligence of MLLMs to the best of our knowledge. Using a task-specific, specialist-driven automated pipeline, we curate over 38K video scenes across 5 spatial scales to create SpaceVista-1M, a dataset comprising approximately 1M spatial QA pairs spanning 19 diverse task types. While specialist models can inject useful domain knowledge, they are not reliable for evaluation. We then build an all-scale benchmark with precise annotations by manually recording, retrieving, and assembling video-based data. However, naive training with SpaceVista-1M often yields suboptimal results due to the potential knowledge conflict. Accordingly, we introduce SpaceVista-7B, a spatial reasoning model that accepts dense inputs beyond semantics and uses scale as an anchor for scale-aware experts and progressive rewards. Finally, extensive evaluations across 5 benchmarks, including our SpaceVista-Bench, demonstrate competitive performance, showcasing strong generalization across all scales and scenarios. Our dataset, model, and benchmark will be released on https://peiwensun2000.github.io/mm2km .
PDF173October 13, 2025