SpatialVID：一個具有空間標註的大規模影片資料集

摘要

在空間智能領域已取得顯著進展，涵蓋了空間重建與世界探索兩大方面。然而，當前模型的可擴展性與現實世界逼真度仍因大規模、高質量訓練數據的稀缺而受到嚴重限制。儘管多個數據集提供了相機姿態信息，但它們通常在規模、多樣性及註釋豐富度上存在局限，尤其是針對具有真實相機運動的現實世界動態場景。為此，我們收集了SpatialVID數據集，該數據集包含大量野外拍攝的視頻，涵蓋多樣場景、相機運動以及密集的三維註釋，如逐幀相機姿態、深度和運動指令。具體而言，我們收集了超過21,000小時的原始視頻，並通過分層過濾管道將其處理成270萬個片段，總計7,089小時的動態內容。隨後，一個註釋管道為這些片段增添了詳細的空間與語義信息，包括相機姿態、深度圖、動態遮罩、結構化描述和序列化運動指令。對SpatialVID數據統計的分析顯示，其豐富性與多樣性直接促進了模型泛化能力與性能的提升，使其成為視頻與三維視覺研究領域的關鍵資源。

English

Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect SpatialVID, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.

SpatialVID：一個具有空間標註的大規模影片資料集

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

摘要

Support