ChatPaper.aiChatPaper

S2D:基于稀疏至稠密关键掩码蒸馏的无监督视频实例分割方法

S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation

December 16, 2025
作者: Leon Sick, Lukas Hoyer, Dominik Engel, Pedro Hermosilla, Timo Ropinski
cs.AI

摘要

近年来,无监督视频实例分割领域的最先进技术严重依赖基于以物体为中心的图像数据集(如ImageNet)生成的合成视频数据。然而,通过人为平移和缩放图像实例掩码生成的视频,难以准确模拟真实视频中的运动模式,例如视角变化、单个或多个实例部件的运动、或相机运动。为解决这一问题,我们提出了一种仅使用真实视频数据训练的无监督视频实例分割模型。该方法从单帧视频的无监督实例分割掩码出发,但这些单帧分割结果存在时序噪声且质量参差不齐。为此,我们通过深度运动先验识别视频中的高质量关键掩码,从而建立时序一致性。利用稀疏关键掩码伪标注数据,我们提出结合时序丢弃损失函数的稀疏至稠密蒸馏方法,训练用于隐式掩码传播的分割模型。在生成的稠密标签集上完成最终模型训练后,本方法在多项基准测试中均超越了当前最优性能。
English
In recent years, the state-of-the-art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object-centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single-frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo-annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse-To-Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state-of-the-art across various benchmarks.
PDF01December 18, 2025