Holi-Spatial：将视频流演进为整体三维空间智能

摘要

对空间智能的追求从根本上依赖于大规模、细粒度的三维数据。然而，现有方法主要通过从有限的手动标注数据集中生成问答对来构建空间理解基准，而非系统性地从原始网络数据中标注新的大规模三维场景。这导致其可扩展性严重受限，且模型性能进一步受到这些狭窄数据集固有领域差异的阻碍。本文提出Holi-Spatial——首个完全自动化构建的大规模空间感知多模态数据集。该数据集通过我们设计的数据处理流程，无需人工干预即可从原始视频输入中构建，支持从带有渲染深度图的几何精确三维高斯溅射重建，到物体级和关系型语义标注的多层次空间监督，并包含对应的空间问答对。基于系统化构建流程，我们进一步创建了首个大规模高质量三维语义数据集Holi-Spatial-4M，包含1.2万个优化后的三维高斯溅射场景、130万个二维掩码、32万个三维边界框、32万个实例描述、120万个三维定位实例，以及覆盖几何推理、关系推理和语义推理等多样化任务的120万个空间问答对。 Holi-Spatial在数据构建质量上展现出卓越性能，在ScanNet、ScanNet++和DL3DV等数据集上显著优于现有前馈方法和逐场景优化方法。此外，基于该数据集对视觉语言模型进行空间推理任务的微调，也带来了模型性能的显著提升。

English

The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question-Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.

Holi-Spatial：将视频流演进为整体三维空间智能

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

摘要

Support