視訊實例抠像

摘要

傳統的視訊抠像在視訊幀中為所有實例輸出一個 alpha 抠像，因此無法區分各個實例。而視訊實例分割提供了時間一致的實例遮罩，但由於應用了二值化，對於抠像應用來說結果並不滿意。為了補救這一不足，我們提出了視訊實例抠像（VIM），即在視頻序列的每一幀中估計每個實例的 alpha 抠像。為應對這一具有挑戰性的問題，我們提出了 MSG-VIM，一種 Mask Sequence Guided Video Instance Matting 神經網絡，作為 VIM 的一種新基準模型。MSG-VIM 利用一系列遮罩增強來使預測對不準確和不一致的遮罩引導具有魯棒性。它結合了時間遮罩和時間特徵引導，以改善 alpha 抠像預測的時間一致性。此外，我們建立了一個新的 VIM 基準，稱為 VIM50，其中包括 50 個視頻剪輯，具有多個人類實例作為前景對象。為了評估在 VIM 任務上的性能，我們引入了一個適合的指標，稱為 Video Instance-aware Matting Quality（VIMQ）。我們提出的模型 MSG-VIM 在 VIM50 基準上確立了一個強大的基準線，並且在很大程度上優於現有方法。該項目在 https://github.com/SHI-Labs/VIM 上開源。

English

Conventional video matting outputs one alpha matte for all instances appearing in a video frame so that individual instances are not distinguished. While video instance segmentation provides time-consistent instance masks, results are unsatisfactory for matting applications, especially due to applied binarization. To remedy this deficiency, we propose Video Instance Matting~(VIM), that is, estimating alpha mattes of each instance at each frame of a video sequence. To tackle this challenging problem, we present MSG-VIM, a Mask Sequence Guided Video Instance Matting neural network, as a novel baseline model for VIM. MSG-VIM leverages a mixture of mask augmentations to make predictions robust to inaccurate and inconsistent mask guidance. It incorporates temporal mask and temporal feature guidance to improve the temporal consistency of alpha matte predictions. Furthermore, we build a new benchmark for VIM, called VIM50, which comprises 50 video clips with multiple human instances as foreground objects. To evaluate performances on the VIM task, we introduce a suitable metric called Video Instance-aware Matting Quality~(VIMQ). Our proposed model MSG-VIM sets a strong baseline on the VIM50 benchmark and outperforms existing methods by a large margin. The project is open-sourced at https://github.com/SHI-Labs/VIM.