S2D:基於稀疏至稠密關鍵遮罩蒸餾的無監督影片實例分割
S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation
December 16, 2025
作者: Leon Sick, Lukas Hoyer, Dominik Engel, Pedro Hermosilla, Timo Ropinski
cs.AI
摘要
近年来,无监督视频实例分割领域的最先进技术严重依赖合成视频数据,这些数据通常通过处理以目标为中心的图像数据集(如ImageNet)生成。然而,通过人为平移和缩放图像实例掩码生成的视频,难以准确模拟真实视频中的运动特征,例如视角变化、单个或多个实例部件的运动,以及相机运动。为解决这一问题,我们提出了一种仅基于真实视频数据训练的无监督视频实例分割模型。该模型从单帧视频的无监督实例分割掩码出发,但这些单帧分割结果存在时序噪声,且质量在视频中波动不定。为此,我们利用深度运动先验识别视频中的高质量关键掩码,从而建立时序一致性。通过稀疏关键掩码的伪标注数据,我们采用稀疏到稠密蒸馏方法并辅以时序丢弃损失,训练用于隐式掩码传播的分割模型。在基于生成的稠密标签集完成最终模型训练后,本方法在多项基准测试中均超越了当前最先进技术。
English
In recent years, the state-of-the-art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object-centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single-frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo-annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse-To-Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state-of-the-art across various benchmarks.