Holi-Spatial: Het Evolueren van Videostreams naar Holistische 3D Ruimtelijke Intelligentie

Samenvatting

De ontwikkeling van ruimtelijke intelligentie is fundamenteel afhankelijk van toegang tot grootschalige, gedetailleerde 3D-data. Bestaande methoden construeren echter voornamelijk benchmarks voor ruimtelijk begrip door vraag-antwoordparen (QA-paren) te genereren vanuit een beperkt aantal handmatig geannoteerde datasets, in plaats van systematisch nieuwe grootschalige 3D-scènes te annoteren vanuit ruwe webdata. Hierdoor is hun schaalbaarheid ernstig beperkt, en wordt modelprestatie verder belemmerd door domeinkloven die inherent zijn aan deze nauwgezet samengestelde datasets. In dit werk presenteren we Holi-Spatial, de eerste volledig geautomatiseerde, grootschalige, ruimtelijk-bewuste multimodale dataset, geconstrueerd vanuit ruwe video-invoer zonder menselijke tussenkomst, gebruikmakend van de voorgestelde datacuratiepijplijn. Holi-Spatial ondersteunt multi-level ruimtelijke supervisie, variërend van geometrisch accurate 3D Gaussian Splatting (3DGS)-reconstructies met gerenderde dieptekaarten tot objectniveau- en relationele semantische annotaties, samen met bijbehorende ruimtelijke vraag-antwoordparen (QA-paren). Volgens een principiële en systematische pijplijn construeren we verder Holi-Spatial-4M, de eerste grootschalige, hoogwaardige 3D semantische dataset, die 12K geoptimaliseerde 3DGS-scènes, 1.3M 2D-maskers, 320K 3D-afgebakende gebieden (bounding boxes), 320K instantie-bijschriften, 1.2M 3D-grondingsinstanties en 1.2M ruimtelijke QA-paren bevat, welke diverse geometrische, relationele en semantische redeneertaken bestrijken. Holi-Spatial toont uitzonderlijke prestaties in de kwaliteit van datacuratie en overtreft bestaande feed-forward en per-scène geoptimaliseerde methoden aanzienlijk op datasets zoals ScanNet, ScanNet++ en DL3DV. Bovendien heeft het fine-tunen van Vision-Language Models (VLMs) voor ruimtelijke redeneertaken met behulp van deze dataset ook geleid tot substantiële verbeteringen in modelprestaties.

English

The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question-Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.

Holi-Spatial: Het Evolueren van Videostreams naar Holistische 3D Ruimtelijke Intelligentie

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

Samenvatting

Support