ChatPaper.aiChatPaper

训练X射线视觉:从多摄像头视频中实现非模态分割、非模态内容补全与视角不变性物体表征

Training for X-Ray Vision: Amodal Segmentation, Amodal Content Completion, and View-Invariant Object Representation from Multi-Camera Video

July 1, 2025
作者: Alexander Moore, Amar Saini, Kylie Cancilla, Doug Poland, Carmen Carrano
cs.AI

摘要

非模态分割与非模态内容补全需要利用物体先验知识来估计复杂场景中被遮挡物体的掩码与特征。迄今为止,尚无数据为物体上下文提供额外维度:即多台摄像机共享同一场景视角的可能性。我们推出了MOVi-MC-AC:多摄像机多物体视频与非模态内容数据集,这是迄今为止规模最大的非模态分割及首个非模态内容数据集。该数据集通过多摄像机视频模拟了家庭通用物品的杂乱场景。MOVi-MC-AC在计算机视觉深度学习领域做出了两项新贡献,丰富了物体检测、追踪与分割的研究文献。多摄像机(MC)设置中,物体能在不同独特摄像机视角间被识别与追踪,这在合成视频与现实世界视频中均属罕见。我们通过为单一场景中具有独特特征与运动模式的多摄像机间帧与帧的检测与分割提供一致的物体ID,为合成视频引入了新的复杂性。非模态内容(AC)是一项重建任务,模型需预测目标物体在遮挡下的外观。在非模态分割文献中,已有部分数据集发布了非模态检测、追踪与分割标签。而其他方法依赖缓慢的剪切粘贴方案生成非模态内容伪标签,却未考虑模态掩码中存在的自然遮挡。MOVi-MC-AC为约580万个物体实例提供了标签,创下了非模态数据集文献的新高,并首次提供了真实非模态内容的基准。完整数据集可在https://huggingface.co/datasets/Amar-S/MOVi-MC-AC获取。
English
Amodal segmentation and amodal content completion require using object priors to estimate occluded masks and features of objects in complex scenes. Until now, no data has provided an additional dimension for object context: the possibility of multiple cameras sharing a view of a scene. We introduce MOVi-MC-AC: Multiple Object Video with Multi-Cameras and Amodal Content, the largest amodal segmentation and first amodal content dataset to date. Cluttered scenes of generic household objects are simulated in multi-camera video. MOVi-MC-AC contributes to the growing literature of object detection, tracking, and segmentation by including two new contributions to the deep learning for computer vision world. Multiple Camera (MC) settings where objects can be identified and tracked between various unique camera perspectives are rare in both synthetic and real-world video. We introduce a new complexity to synthetic video by providing consistent object ids for detections and segmentations between both frames and multiple cameras each with unique features and motion patterns on a single scene. Amodal Content (AC) is a reconstructive task in which models predict the appearance of target objects through occlusions. In the amodal segmentation literature, some datasets have been released with amodal detection, tracking, and segmentation labels. While other methods rely on slow cut-and-paste schemes to generate amodal content pseudo-labels, they do not account for natural occlusions present in the modal masks. MOVi-MC-AC provides labels for ~5.8 million object instances, setting a new maximum in the amodal dataset literature, along with being the first to provide ground-truth amodal content. The full dataset is available at https://huggingface.co/datasets/Amar-S/MOVi-MC-AC ,
PDF81July 2, 2025