Holi-Spatial: 비디오 스트림을 전체론적 3D 공간 지능으로 진화시키다

초록

공간 지능의 추구는 근본적으로 대규모의 정밀한 3D 데이터에 대한 접근에 의존합니다. 그러나 기존 접근법들은 새로운 대규모 3D 장면을 원본 웹 데이터로부터 체계적으로 주석 처리하기보다는 제한된 수의 수동 주석 데이터셋에서 질문-답변(QA) 쌍을 생성하여 공간 이해 벤치마크를 구축하는 데 주력해 왔습니다. 그 결과, 확장성이 심각하게 제한되며, 모델 성능은 이러한 협소하게 선별된 데이터셋에 내재된 도메인 간격으로 인해 더욱 저해됩니다. 본 연구에서는 제안된 데이터 선별 파이프라인을 사용하여 인간의 개입 없이 원본 비디오 입력으로부터 구축된 최초의 완전 자동화된 대규모 공간 인식 멀티모달 데이터셋인 Holi-Spatial을 제안합니다. Holi-Spatial은 렌더링된 깊이 맵과 함께 기하학적으로 정확한 3D Gaussian Splatting(3DGS) 재구성부터 객체 수준 및 관계적 의미론 주석, 그리고 이에 상응하는 공간 질문-답변(QA) 쌍에 이르기까지 다중 수준의 공간 감독을 지원합니다. 원칙적이고 체계적인 파이프라인에 따라, 우리는 12K개의 최적화된 3DGS 장면, 130만 개의 2D 마스크, 32만 개의 3D 바운딩 박스, 32만 개의 인스턴스 캡션, 120만 개의 3D 그라운딩 인스턴스, 그리고 다양한 기하학적, 관계적, 의미론적 추론 작업을 아우르는 120만 개의 공간 QA 쌍을 포함하는 최초의 대규모 고품질 3D 의미론 데이터셋인 Holi-Spatial-4M을 추가로 구축했습니다. Holi-Spatial은 데이터 선별 품질에서 탁월한 성능을 보여주며, ScanNet, ScanNet++, DL3DV와 같은 데이터셋에서 기존의 피드포워드 및 장면별 최적화 방법들을 크게 능가합니다. 더 나아가, 이 데이터셋을 사용하여 공간 추론 작업에 대해 Vision-Language Models(VLMs)을 미세 조정함으로써 모델 성능에도 상당한 개선이 이루어졌습니다.

English

The pursuit of spatial intelligence fundamentally relies on access to large-scale, fine-grained 3D data. However, existing approaches predominantly construct spatial understanding benchmarks by generating question-answer (QA) pairs from a limited number of manually annotated datasets, rather than systematically annotating new large-scale 3D scenes from raw web data. As a result, their scalability is severely constrained, and model performance is further hindered by domain gaps inherent in these narrowly curated datasets. In this work, we propose Holi-Spatial, the first fully automated, large-scale, spatially-aware multimodal dataset, constructed from raw video inputs without human intervention, using the proposed data curation pipeline. Holi-Spatial supports multi-level spatial supervision, ranging from geometrically accurate 3D Gaussian Splatting (3DGS) reconstructions with rendered depth maps to object-level and relational semantic annotations, together with corresponding spatial Question-Answer (QA) pairs. Following a principled and systematic pipeline, we further construct Holi-Spatial-4M, the first large-scale, high-quality 3D semantic dataset, containing 12K optimized 3DGS scenes, 1.3M 2D masks, 320K 3D bounding boxes, 320K instance captions, 1.2M 3D grounding instances, and 1.2M spatial QA pairs spanning diverse geometric, relational, and semantic reasoning tasks. Holi-Spatial demonstrates exceptional performance in data curation quality, significantly outperforming existing feed-forward and per-scene optimized methods on datasets such as ScanNet, ScanNet++, and DL3DV. Furthermore, fine-tuning Vision-Language Models (VLMs) on spatial reasoning tasks using this dataset has also led to substantial improvements in model performance.

Holi-Spatial: 비디오 스트림을 전체론적 3D 공간 지능으로 진화시키다

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

초록

Support