通過真實場景先驗，朝向野外自然圖像抠像

摘要

最近的研究方法試圖將強大的互動分割模型（如SAM）適應互動抠像，並根據合成抠像數據集對模型進行微調。然而，在合成數據上訓練的模型無法推廣應用於複雜和遮擋場景。我們通過提出一個基於COCO數據集的新抠像數據集來應對這一挑戰，即COCO-抠像。具體來說，我們的COCO-抠像構建包括配件融合和遮罩轉抠像，從COCO中選擇真實世界的複雜圖像，並將語義分割遮罩轉換為抠像標籤。構建的COCO-抠像包含了38251個複雜自然場景中的人類實例級α抠像的廣泛集合。此外，現有基於SAM的抠像方法從凍結的SAM中提取中間特徵和遮罩，並僅通過端到端抠像損失訓練輕量級抠像解碼器，未充分發揮預訓練SAM的潛力。因此，我們提出了SEMat，重新設計了網絡架構和訓練目標。對於網絡架構，提出的特徵對齊變壓器學習提取精細的邊緣和透明度特徵。提出的抠像對齊解碼器旨在分割抠像特定對象並將粗糙遮罩轉換為高精度抠像。對於訓練目標，提出的正則化和trimap損失旨在保留來自預訓練模型的先驗知識，並將從遮罩解碼器中提取的抠像對數包含trimap基礎語義信息。在七個不同數據集上進行的大量實驗表明了我們方法的優越性能，證明了其在互動自然圖像抠像中的有效性。我們在https://github.com/XiaRho/SEMat 上開源我們的代碼、模型和數據集。

English

Recent approaches attempt to adapt powerful interactive segmentation models, such as SAM, to interactive matting and fine-tune the models based on synthetic matting datasets. However, models trained on synthetic data fail to generalize to complex and occlusion scenes. We address this challenge by proposing a new matting dataset based on the COCO dataset, namely COCO-Matting. Specifically, the construction of our COCO-Matting includes accessory fusion and mask-to-matte, which selects real-world complex images from COCO and converts semantic segmentation masks to matting labels. The built COCO-Matting comprises an extensive collection of 38,251 human instance-level alpha mattes in complex natural scenarios. Furthermore, existing SAM-based matting methods extract intermediate features and masks from a frozen SAM and only train a lightweight matting decoder by end-to-end matting losses, which do not fully exploit the potential of the pre-trained SAM. Thus, we propose SEMat which revamps the network architecture and training objectives. For network architecture, the proposed feature-aligned transformer learns to extract fine-grained edge and transparency features. The proposed matte-aligned decoder aims to segment matting-specific objects and convert coarse masks into high-precision mattes. For training objectives, the proposed regularization and trimap loss aim to retain the prior from the pre-trained model and push the matting logits extracted from the mask decoder to contain trimap-based semantic information. Extensive experiments across seven diverse datasets demonstrate the superior performance of our method, proving its efficacy in interactive natural image matting. We open-source our code, models, and dataset at https://github.com/XiaRho/SEMat.

通過真實場景先驗，朝向野外自然圖像抠像

Towards Natural Image Matting in the Wild via Real-Scenario Prior

摘要

Support