ビデオインスタンスマッティング

要旨

従来のビデオマッティングでは、ビデオフレーム内に現れるすべてのインスタンスに対して単一のアルファマットを出力するため、個々のインスタンスを区別することができません。一方、ビデオインスタンスセグメンテーションは時間的に一貫したインスタンスマスクを提供しますが、特に二値化が適用されるため、マッティングアプリケーションでは結果が不十分です。この欠点を補うため、我々はVideo Instance Matting（VIM）、つまりビデオシーケンスの各フレームにおける各インスタンスのアルファマットを推定する手法を提案します。この難しい問題に取り組むために、MSG-VIM（Mask Sequence Guided Video Instance Matting）ニューラルネットワークを、VIMの新しいベースラインモデルとして提示します。MSG-VIMは、不正確で一貫性のないマスクガイダンスに対して予測をロバストにするために、マスク拡張の組み合わせを活用します。また、時間的なマスクと時間的特徴ガイダンスを取り入れることで、アルファマット予測の時間的一貫性を向上させます。さらに、VIMのための新しいベンチマークであるVIM50を構築しました。VIM50は、複数の人間インスタンスを前景オブジェクトとする50のビデオクリップで構成されています。VIMタスクの性能を評価するために、Video Instance-aware Matting Quality（VIMQ）と呼ばれる適切な指標を導入します。我々が提案するモデルMSG-VIMは、VIM50ベンチマークにおいて強力なベースラインを設定し、既存の手法を大きく上回る性能を示します。本プロジェクトはhttps://github.com/SHI-Labs/VIMでオープンソース化されています。

English

Conventional video matting outputs one alpha matte for all instances appearing in a video frame so that individual instances are not distinguished. While video instance segmentation provides time-consistent instance masks, results are unsatisfactory for matting applications, especially due to applied binarization. To remedy this deficiency, we propose Video Instance Matting~(VIM), that is, estimating alpha mattes of each instance at each frame of a video sequence. To tackle this challenging problem, we present MSG-VIM, a Mask Sequence Guided Video Instance Matting neural network, as a novel baseline model for VIM. MSG-VIM leverages a mixture of mask augmentations to make predictions robust to inaccurate and inconsistent mask guidance. It incorporates temporal mask and temporal feature guidance to improve the temporal consistency of alpha matte predictions. Furthermore, we build a new benchmark for VIM, called VIM50, which comprises 50 video clips with multiple human instances as foreground objects. To evaluate performances on the VIM task, we introduce a suitable metric called Video Instance-aware Matting Quality~(VIMQ). Our proposed model MSG-VIM sets a strong baseline on the VIM50 benchmark and outperforms existing methods by a large margin. The project is open-sourced at https://github.com/SHI-Labs/VIM.