実シナリオ事前情報を活用した野外での自然画像マッティングに向けて

要旨

最近のアプローチでは、SAMなどの強力なインタラクティブセグメンテーションモデルを、インタラクティブマッティングに適応させ、合成マッティングデータセットに基づいてモデルを微調整しようとしています。しかし、合成データでトレーニングされたモデルは、複雑な遮蔽シーンに一般化できません。私たちはこの課題に取り組み、COCOデータセットに基づいた新しいマッティングデータセット、COCO-Mattingを提案します。具体的には、COCO-Mattingの構築には、アクセサリー融合とマスクからマットへの変換が含まれます。これにより、COCOから実世界の複雑な画像を選択し、セマンティックセグメンテーションマスクをマッティングラベルに変換します。構築されたCOCO-Mattingには、複雑な自然シナリオでの38,251個の人物インスタンスレベルのアルファマットが豊富に含まれています。さらに、既存のSAMベースのマッティング手法は、凍結したSAMから中間特徴とマスクを抽出し、エンドツーエンドのマッティング損失によって軽量なマッティングデコーダーのみをトレーニングしますが、事前にトレーニングされたSAMの潜在能力を十分に活用していません。そのため、私たちはネットワークアーキテクチャとトレーニング目標を刷新するSEMatを提案します。ネットワークアーキテクチャでは、提案された特徴整列トランスフォーマーが微細なエッジと透明性の特徴を抽出することを学習します。提案されたマット整列デコーダーは、マッティング固有のオブジェクトをセグメント化し、粗いマスクを高精度のマットに変換します。トレーニング目標では、提案された正則化とトリマップ損失は、事前にトレーニングされたモデルからの事前情報を保持し、マスクデコーダーから抽出されたマッティングロジットにトリマップベースのセマンティック情報を含めるようにします。7つの異なるデータセットを対象とした幅広い実験により、当社の手法の優れたパフォーマンスが証明され、インタラクティブな自然画像のマッティングにおける有効性が示されています。当社のコード、モデル、データセットは、https://github.com/XiaRho/SEMat でオープンソース化されています。

English

Recent approaches attempt to adapt powerful interactive segmentation models, such as SAM, to interactive matting and fine-tune the models based on synthetic matting datasets. However, models trained on synthetic data fail to generalize to complex and occlusion scenes. We address this challenge by proposing a new matting dataset based on the COCO dataset, namely COCO-Matting. Specifically, the construction of our COCO-Matting includes accessory fusion and mask-to-matte, which selects real-world complex images from COCO and converts semantic segmentation masks to matting labels. The built COCO-Matting comprises an extensive collection of 38,251 human instance-level alpha mattes in complex natural scenarios. Furthermore, existing SAM-based matting methods extract intermediate features and masks from a frozen SAM and only train a lightweight matting decoder by end-to-end matting losses, which do not fully exploit the potential of the pre-trained SAM. Thus, we propose SEMat which revamps the network architecture and training objectives. For network architecture, the proposed feature-aligned transformer learns to extract fine-grained edge and transparency features. The proposed matte-aligned decoder aims to segment matting-specific objects and convert coarse masks into high-precision mattes. For training objectives, the proposed regularization and trimap loss aim to retain the prior from the pre-trained model and push the matting logits extracted from the mask decoder to contain trimap-based semantic information. Extensive experiments across seven diverse datasets demonstrate the superior performance of our method, proving its efficacy in interactive natural image matting. We open-source our code, models, and dataset at https://github.com/XiaRho/SEMat.

実シナリオ事前情報を活用した野外での自然画像マッティングに向けて

Towards Natural Image Matting in the Wild via Real-Scenario Prior

要旨

Support