通过真实场景先验信息实现野外自然图像抠图

摘要

最近的研究方法尝试将强大的交互式分割模型，如SAM，应用于交互式抠图，并基于合成抠图数据集对模型进行微调。然而，在合成数据上训练的模型无法推广到复杂和遮挡场景。我们通过提出基于COCO数据集的新抠图数据集来解决这一挑战，即COCO抠图。具体而言，我们的COCO抠图构建包括配件融合和mask-to-matte，从COCO中选择真实世界的复杂图像，并将语义分割mask转换为抠图标签。构建的COCO抠图包括38251个复杂自然场景中的人类实例级alpha抠图的广泛集合。此外，现有基于SAM的抠图方法从冻结的SAM中提取中间特征和mask，仅通过端到端抠图损失训练轻量级抠图解码器，未充分利用预训练SAM的潜力。因此，我们提出了SEMat，重新设计了网络架构和训练目标。在网络架构方面，提出的特征对齐变换器学习提取细粒度的边缘和透明度特征。提出的抠图对齐解码器旨在分割抠图特定对象，并将粗糙mask转换为高精度抠图。在训练目标方面，提出的正则化和trimap损失旨在保留来自预训练模型的先验知识，并推动从mask解码器提取的抠图logits包含基于trimap的语义信息。在七个不同数据集上进行的大量实验表明我们方法的卓越性能，证明了其在交互式自然图像抠图中的有效性。我们在https://github.com/XiaRho/SEMat 开源我们的代码、模型和数据集。

English

Recent approaches attempt to adapt powerful interactive segmentation models, such as SAM, to interactive matting and fine-tune the models based on synthetic matting datasets. However, models trained on synthetic data fail to generalize to complex and occlusion scenes. We address this challenge by proposing a new matting dataset based on the COCO dataset, namely COCO-Matting. Specifically, the construction of our COCO-Matting includes accessory fusion and mask-to-matte, which selects real-world complex images from COCO and converts semantic segmentation masks to matting labels. The built COCO-Matting comprises an extensive collection of 38,251 human instance-level alpha mattes in complex natural scenarios. Furthermore, existing SAM-based matting methods extract intermediate features and masks from a frozen SAM and only train a lightweight matting decoder by end-to-end matting losses, which do not fully exploit the potential of the pre-trained SAM. Thus, we propose SEMat which revamps the network architecture and training objectives. For network architecture, the proposed feature-aligned transformer learns to extract fine-grained edge and transparency features. The proposed matte-aligned decoder aims to segment matting-specific objects and convert coarse masks into high-precision mattes. For training objectives, the proposed regularization and trimap loss aim to retain the prior from the pre-trained model and push the matting logits extracted from the mask decoder to contain trimap-based semantic information. Extensive experiments across seven diverse datasets demonstrate the superior performance of our method, proving its efficacy in interactive natural image matting. We open-source our code, models, and dataset at https://github.com/XiaRho/SEMat.

通过真实场景先验信息实现野外自然图像抠图

Towards Natural Image Matting in the Wild via Real-Scenario Prior

摘要

Support