Holi-Spatial：映像ストリームから全体的な3D空間知能への進化

要旨

空間知能の追求は、大規模で詳細な3Dデータへのアクセスに根本的に依存している。しかし既存の手法では、新たな大規模3Dシーンの体系的なアノテーションではなく、限られた手動注釈データセットから質問応答（QA）ペアを生成することで空間理解ベンチマークを構築するのが主流である。その結果、手法の拡張性は深刻に制約され、モデル性能はこれらの狭く選別されたデータセットに内在するドメインギャップによってさらに阻害されている。本研究では、提案するデータ精製パイプラインを用いて、人間の介入なしに生の動画入力から構築された、初の完全自動化・大規模・空間認識型マルチモーダルデータセットであるHoli-Spatialを提案する。Holi-Spatialは、深度マップをレンダリングした幾何学的に正確な3D Gaussian Splatting（3DGS）再構成から、オブジェクトレベルおよび関係的意味論的アノテーション、対応する空間的質問応答（QA）ペアまで、マルチレベルの空間的教師信号をサポートする。原理に基づいた体系的なパイプラインに従い、我々はさらに初の大規模高品質3D意味論データセットであるHoli-Spatial-4Mを構築した。これは12Kの最適化された3DGSシーン、130万の2Dマスク、32万の3Dバウンディングボックス、32万のインスタンスキャプション、120万の3Dグラウンディングインスタンス、そして多様な幾何学的・関係的・意味論的推論タスクに跨る120万の空間的QAペアを含む。 Holi-Spatialはデータ精製品質において卓越した性能を示し、ScanNet、ScanNet++、DL3DVなどのデータセットにおいて、既存のフィードフォワード手法やシーン単位最適化手法を大幅に上回る。さらに、このデータセットを用いて空間推論タスクでVision-Languageモデル（VLM）をファインチューニングした結果、モデル性能の大幅な改善も達成されている。

English

The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question-Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.

Holi-Spatial：映像ストリームから全体的な3D空間知能への進化

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

要旨

Support