訓練X射線視覺：從多攝像頭視頻中實現模態分割、模態內容補全與視角不變的物體表徵

摘要

非模態分割與非模態內容補全需利用物體先驗知識來估計複雜場景中被遮擋物體的掩碼與特徵。迄今為止，尚無數據集能為物體上下文提供額外的維度：即多個攝像頭共享同一場景視角的可能性。我們推出了MOVi-MC-AC：多攝像頭下的多物體視頻與非模態內容，這是迄今為止最大的非模態分割及首個非模態內容數據集。該數據集模擬了多攝像頭視頻中普通家居物品的雜亂場景。MOVi-MC-AC通過引入兩項新貢獻，豐富了計算機視覺領域深度學習在物體檢測、追蹤及分割方面的文獻。多攝像頭（MC）設置下，物體能在不同獨特攝像頭視角間被識別與追蹤，這在合成與現實世界視頻中均屬罕見。我們通過為單一場景中每幀及多個攝像頭（各具獨特特徵與運動模式）的檢測與分割提供一致的物體ID，為合成視頻引入了新的複雜性。非模態內容（AC）是一項重建任務，模型需預測目標物體在遮擋下的外觀。在非模態分割文獻中，已有部分數據集發布了非模態檢測、追蹤及分割標籤。而其他方法依賴於耗時的剪切粘貼方案生成非模態內容偽標籤，卻未考慮模態掩碼中存在的自然遮擋。MOVi-MC-AC提供了約580萬個物體實例的標籤，創下了非模態數據集文獻的新高，並首次提供了真實的非模態內容。完整數據集可訪問https://huggingface.co/datasets/Amar-S/MOVi-MC-AC獲取。

English

Amodal segmentation and amodal content completion require using object priors to estimate occluded masks and features of objects in complex scenes. Until now, no data has provided an additional dimension for object context: the possibility of multiple cameras sharing a view of a scene. We introduce MOVi-MC-AC: Multiple Object Video with Multi-Cameras and Amodal Content, the largest amodal segmentation and first amodal content dataset to date. Cluttered scenes of generic household objects are simulated in multi-camera video. MOVi-MC-AC contributes to the growing literature of object detection, tracking, and segmentation by including two new contributions to the deep learning for computer vision world. Multiple Camera (MC) settings where objects can be identified and tracked between various unique camera perspectives are rare in both synthetic and real-world video. We introduce a new complexity to synthetic video by providing consistent object ids for detections and segmentations between both frames and multiple cameras each with unique features and motion patterns on a single scene. Amodal Content (AC) is a reconstructive task in which models predict the appearance of target objects through occlusions. In the amodal segmentation literature, some datasets have been released with amodal detection, tracking, and segmentation labels. While other methods rely on slow cut-and-paste schemes to generate amodal content pseudo-labels, they do not account for natural occlusions present in the modal masks. MOVi-MC-AC provides labels for ~5.8 million object instances, setting a new maximum in the amodal dataset literature, along with being the first to provide ground-truth amodal content. The full dataset is available at https://huggingface.co/datasets/Amar-S/MOVi-MC-AC ,

訓練X射線視覺：從多攝像頭視頻中實現模態分割、模態內容補全與視角不變的物體表徵

Training for X-Ray Vision: Amodal Segmentation, Amodal Content Completion, and View-Invariant Object Representation from Multi-Camera Video

摘要

Support