通过真实场景先验信息实现野外自然图像抠图
Towards Natural Image Matting in the Wild via Real-Scenario Prior
October 9, 2024
作者: Ruihao Xia, Yu Liang, Peng-Tao Jiang, Hao Zhang, Qianru Sun, Yang Tang, Bo Li, Pan Zhou
cs.AI
摘要
最近的研究方法尝试将强大的交互式分割模型,如SAM,应用于交互式抠图,并基于合成抠图数据集对模型进行微调。然而,在合成数据上训练的模型无法推广到复杂和遮挡场景。我们通过提出基于COCO数据集的新抠图数据集来解决这一挑战,即COCO抠图。具体而言,我们的COCO抠图构建包括配件融合和mask-to-matte,从COCO中选择真实世界的复杂图像,并将语义分割mask转换为抠图标签。构建的COCO抠图包括38251个复杂自然场景中的人类实例级alpha抠图的广泛集合。此外,现有基于SAM的抠图方法从冻结的SAM中提取中间特征和mask,仅通过端到端抠图损失训练轻量级抠图解码器,未充分利用预训练SAM的潜力。因此,我们提出了SEMat,重新设计了网络架构和训练目标。在网络架构方面,提出的特征对齐变换器学习提取细粒度的边缘和透明度特征。提出的抠图对齐解码器旨在分割抠图特定对象,并将粗糙mask转换为高精度抠图。在训练目标方面,提出的正则化和trimap损失旨在保留来自预训练模型的先验知识,并推动从mask解码器提取的抠图logits包含基于trimap的语义信息。在七个不同数据集上进行的大量实验表明我们方法的卓越性能,证明了其在交互式自然图像抠图中的有效性。我们在https://github.com/XiaRho/SEMat 开源我们的代码、模型和数据集。
English
Recent approaches attempt to adapt powerful interactive segmentation models,
such as SAM, to interactive matting and fine-tune the models based on synthetic
matting datasets. However, models trained on synthetic data fail to generalize
to complex and occlusion scenes. We address this challenge by proposing a new
matting dataset based on the COCO dataset, namely COCO-Matting. Specifically,
the construction of our COCO-Matting includes accessory fusion and
mask-to-matte, which selects real-world complex images from COCO and converts
semantic segmentation masks to matting labels. The built COCO-Matting comprises
an extensive collection of 38,251 human instance-level alpha mattes in complex
natural scenarios. Furthermore, existing SAM-based matting methods extract
intermediate features and masks from a frozen SAM and only train a lightweight
matting decoder by end-to-end matting losses, which do not fully exploit the
potential of the pre-trained SAM. Thus, we propose SEMat which revamps the
network architecture and training objectives. For network architecture, the
proposed feature-aligned transformer learns to extract fine-grained edge and
transparency features. The proposed matte-aligned decoder aims to segment
matting-specific objects and convert coarse masks into high-precision mattes.
For training objectives, the proposed regularization and trimap loss aim to
retain the prior from the pre-trained model and push the matting logits
extracted from the mask decoder to contain trimap-based semantic information.
Extensive experiments across seven diverse datasets demonstrate the superior
performance of our method, proving its efficacy in interactive natural image
matting. We open-source our code, models, and dataset at
https://github.com/XiaRho/SEMat.Summary
AI-Generated Summary