訓練X射線視覺:從多攝像頭視頻中實現模態分割、模態內容補全與視角不變的物體表徵
Training for X-Ray Vision: Amodal Segmentation, Amodal Content Completion, and View-Invariant Object Representation from Multi-Camera Video
July 1, 2025
作者: Alexander Moore, Amar Saini, Kylie Cancilla, Doug Poland, Carmen Carrano
cs.AI
摘要
非模態分割與非模態內容補全需利用物體先驗知識來估計複雜場景中被遮擋物體的掩碼與特徵。迄今為止,尚無數據集能為物體上下文提供額外的維度:即多個攝像頭共享同一場景視角的可能性。我們推出了MOVi-MC-AC:多攝像頭下的多物體視頻與非模態內容,這是迄今為止最大的非模態分割及首個非模態內容數據集。該數據集模擬了多攝像頭視頻中普通家居物品的雜亂場景。MOVi-MC-AC通過引入兩項新貢獻,豐富了計算機視覺領域深度學習在物體檢測、追蹤及分割方面的文獻。多攝像頭(MC)設置下,物體能在不同獨特攝像頭視角間被識別與追蹤,這在合成與現實世界視頻中均屬罕見。我們通過為單一場景中每幀及多個攝像頭(各具獨特特徵與運動模式)的檢測與分割提供一致的物體ID,為合成視頻引入了新的複雜性。非模態內容(AC)是一項重建任務,模型需預測目標物體在遮擋下的外觀。在非模態分割文獻中,已有部分數據集發布了非模態檢測、追蹤及分割標籤。而其他方法依賴於耗時的剪切粘貼方案生成非模態內容偽標籤,卻未考慮模態掩碼中存在的自然遮擋。MOVi-MC-AC提供了約580萬個物體實例的標籤,創下了非模態數據集文獻的新高,並首次提供了真實的非模態內容。完整數據集可訪問https://huggingface.co/datasets/Amar-S/MOVi-MC-AC獲取。
English
Amodal segmentation and amodal content completion require using object priors
to estimate occluded masks and features of objects in complex scenes. Until
now, no data has provided an additional dimension for object context: the
possibility of multiple cameras sharing a view of a scene. We introduce
MOVi-MC-AC: Multiple Object Video with Multi-Cameras and Amodal Content, the
largest amodal segmentation and first amodal content dataset to date. Cluttered
scenes of generic household objects are simulated in multi-camera video.
MOVi-MC-AC contributes to the growing literature of object detection, tracking,
and segmentation by including two new contributions to the deep learning for
computer vision world. Multiple Camera (MC) settings where objects can be
identified and tracked between various unique camera perspectives are rare in
both synthetic and real-world video. We introduce a new complexity to synthetic
video by providing consistent object ids for detections and segmentations
between both frames and multiple cameras each with unique features and motion
patterns on a single scene. Amodal Content (AC) is a reconstructive task in
which models predict the appearance of target objects through occlusions. In
the amodal segmentation literature, some datasets have been released with
amodal detection, tracking, and segmentation labels. While other methods rely
on slow cut-and-paste schemes to generate amodal content pseudo-labels, they do
not account for natural occlusions present in the modal masks. MOVi-MC-AC
provides labels for ~5.8 million object instances, setting a new maximum in the
amodal dataset literature, along with being the first to provide ground-truth
amodal content. The full dataset is available at
https://huggingface.co/datasets/Amar-S/MOVi-MC-AC ,