SpatialVID: 공간 주석이 포함된 대규모 비디오 데이터셋

초록

공간 지능 분야에서는 공간 재구성과 세계 탐색 모두에서 상당한 진전이 이루어졌습니다. 그러나 현재 모델의 확장성과 현실 세계에 대한 충실도는 대규모 고품질 학습 데이터의 부족으로 심각하게 제한되고 있습니다. 여러 데이터셋이 카메라 포즈 정보를 제공하지만, 특히 실제 동적 장면과 정확한 카메라 움직임을 포함하는 경우, 규모, 다양성 및 주석 풍부함 측면에서 제한적입니다. 이를 위해 우리는 다양한 장면, 카메라 움직임, 그리고 프레임별 카메라 포즈, 깊이, 동작 지침과 같은 밀집된 3D 주석을 포함한 야외 비디오로 구성된 SpatialVID 데이터셋을 수집했습니다. 구체적으로, 우리는 21,000시간 이상의 원시 비디오를 수집하고, 계층적 필터링 파이프라인을 통해 이를 270만 개의 클립으로 처리하여 총 7,089시간의 동적 콘텐츠를 확보했습니다. 이후의 주석 파이프라인은 이러한 클립에 카메라 포즈, 깊이 맵, 동적 마스크, 구조화된 캡션, 그리고 직렬화된 동작 지침과 같은 상세한 공간 및 의미론적 정보를 추가합니다. SpatialVID의 데이터 통계 분석은 모델의 일반화와 성능 향상을 직접적으로 촉진하는 풍부함과 다양성을 보여주며, 이는 비디오 및 3D 비전 연구 커뮤니티를 위한 핵심 자산으로 자리매김합니다.

English

Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect SpatialVID, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.