视频实例抠图
Video Instance Matting
November 7, 2023
作者: Jiachen Li, Roberto Henschel, Vidit Goel, Marianna Ohanyan, Shant Navasardyan, Humphrey Shi
cs.AI
摘要
传统视频抠图为视频帧中出现的所有实例输出一个 alpha 抠图,因此无法区分各个实例。虽然视频实例分割提供了时间一致的实例掩码,但由于应用了二值化,对抠图应用的结果仍然不尽人意。为了弥补这一不足,我们提出了视频实例抠图(VIM)的概念,即在视频序列的每一帧中估计每个实例的 alpha 抠图。为解决这一具有挑战性的问题,我们提出了 MSG-VIM,即 Mask Sequence Guided Video Instance Matting 神经网络,作为 VIM 的一种新型基准模型。MSG-VIM 利用一系列掩码增强技术,使预测对不准确和不一致的掩码指导具有鲁棒性。它结合了时间掩码和时间特征指导,以提高 alpha 抠图预测的时间一致性。此外,我们建立了一个名为 VIM50 的新的 VIM 基准,其中包含 50 个视频剪辑,具有多个人类实例作为前景对象。为评估在 VIM 任务上的性能,我们引入了一个称为视频实例感知抠图质量(VIMQ)的适当指标。我们提出的模型 MSG-VIM 在 VIM50 基准上树立了强大的基准,并且在很大程度上优于现有方法。该项目在 https://github.com/SHI-Labs/VIM 上开源。
English
Conventional video matting outputs one alpha matte for all instances
appearing in a video frame so that individual instances are not distinguished.
While video instance segmentation provides time-consistent instance masks,
results are unsatisfactory for matting applications, especially due to applied
binarization. To remedy this deficiency, we propose Video Instance
Matting~(VIM), that is, estimating alpha mattes of each instance at each frame
of a video sequence. To tackle this challenging problem, we present MSG-VIM, a
Mask Sequence Guided Video Instance Matting neural network, as a novel baseline
model for VIM. MSG-VIM leverages a mixture of mask augmentations to make
predictions robust to inaccurate and inconsistent mask guidance. It
incorporates temporal mask and temporal feature guidance to improve the
temporal consistency of alpha matte predictions. Furthermore, we build a new
benchmark for VIM, called VIM50, which comprises 50 video clips with multiple
human instances as foreground objects. To evaluate performances on the VIM
task, we introduce a suitable metric called Video Instance-aware Matting
Quality~(VIMQ). Our proposed model MSG-VIM sets a strong baseline on the VIM50
benchmark and outperforms existing methods by a large margin. The project is
open-sourced at https://github.com/SHI-Labs/VIM.