Holi-Spatial：將影片流轉化為全息3D空間智能

摘要

對空間智能的追求根本上依賴於大規模、細粒度的3D數據。然而，現有方法主要通過從少量人工標註數據集生成問答對來構建空間理解基準，而非從原始網絡數據中系統性地標註新的大規模3D場景。這導致其可擴展性嚴重受限，且模型性能進一步受制於這些狹窄精選數據集中固有的領域差異。本研究提出首個全自動化、大規模、具空間感知的多模態數據集Holi-Spatial。該數據集通過提出的數據構建流程，無需人工干預即可從原始視頻輸入生成，支持從帶有渲染深度圖的幾何精確3D高斯潑濺重建，到物件層級與關係性語義標註的多層級空間監督，並包含對應的空間問答對。基於系統化原則構建的流程，我們進一步創建了首個大規模高質量3D語義數據集Holi-Spatial-4M，包含1.2萬個優化後的3D高斯潑濺場景、130萬個2D遮罩、32萬個3D邊界框、32萬個實例描述、120萬個3D定位實例，以及涵蓋多樣化幾何、關係與語義推理任務的120萬組空間問答對。 Holi-Spatial在數據構建質量上展現卓越性能，於ScanNet、ScanNet++和DL3DV等數據集上顯著超越現有前饋式與單場景優化方法。此外，使用該數據集對視覺語言模型進行空間推理任務的微調，亦使模型性能獲得顯著提升。

English

The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question-Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.

Holi-Spatial：將影片流轉化為全息3D空間智能

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

摘要

Support