视频实例抠图

摘要

传统视频抠图为视频帧中出现的所有实例输出一个 alpha 抠图，因此无法区分各个实例。虽然视频实例分割提供了时间一致的实例掩码，但由于应用了二值化，对抠图应用的结果仍然不尽人意。为了弥补这一不足，我们提出了视频实例抠图（VIM）的概念，即在视频序列的每一帧中估计每个实例的 alpha 抠图。为解决这一具有挑战性的问题，我们提出了 MSG-VIM，即 Mask Sequence Guided Video Instance Matting 神经网络，作为 VIM 的一种新型基准模型。MSG-VIM 利用一系列掩码增强技术，使预测对不准确和不一致的掩码指导具有鲁棒性。它结合了时间掩码和时间特征指导，以提高 alpha 抠图预测的时间一致性。此外，我们建立了一个名为 VIM50 的新的 VIM 基准，其中包含 50 个视频剪辑，具有多个人类实例作为前景对象。为评估在 VIM 任务上的性能，我们引入了一个称为视频实例感知抠图质量（VIMQ）的适当指标。我们提出的模型 MSG-VIM 在 VIM50 基准上树立了强大的基准，并且在很大程度上优于现有方法。该项目在 https://github.com/SHI-Labs/VIM 上开源。

English

Conventional video matting outputs one alpha matte for all instances appearing in a video frame so that individual instances are not distinguished. While video instance segmentation provides time-consistent instance masks, results are unsatisfactory for matting applications, especially due to applied binarization. To remedy this deficiency, we propose Video Instance Matting~(VIM), that is, estimating alpha mattes of each instance at each frame of a video sequence. To tackle this challenging problem, we present MSG-VIM, a Mask Sequence Guided Video Instance Matting neural network, as a novel baseline model for VIM. MSG-VIM leverages a mixture of mask augmentations to make predictions robust to inaccurate and inconsistent mask guidance. It incorporates temporal mask and temporal feature guidance to improve the temporal consistency of alpha matte predictions. Furthermore, we build a new benchmark for VIM, called VIM50, which comprises 50 video clips with multiple human instances as foreground objects. To evaluate performances on the VIM task, we introduce a suitable metric called Video Instance-aware Matting Quality~(VIMQ). Our proposed model MSG-VIM sets a strong baseline on the VIM50 benchmark and outperforms existing methods by a large margin. The project is open-sourced at https://github.com/SHI-Labs/VIM.