抠图任意物体

摘要

本文提出了Matting Anything Model（MAM），这是一个高效且多功能的框架，用于估计图像中任何实例的alpha抠图，可通过灵活和交互式的视觉或语言用户提示进行引导。MAM相比先前的专门图像抠图网络具有几个重要优势：（i）MAM能够处理各种类型的图像抠图，包括语义抠图、实例抠图和指代图像抠图，仅需一个模型；（ii）MAM利用了Segment Anything Model（SAM）的特征图，并采用轻量级的Mask-to-Matte（M2M）模块通过迭代细化来预测alpha抠图，仅有270万可训练参数；（iii）通过整合SAM，MAM简化了交互式使用图像抠图时用户介入的需求，从trimap到框、点或文本提示。我们在各种图像抠图基准上评估了MAM的性能，实验结果表明，MAM在每个基准上的不同指标下均达到了与最先进的专门图像抠图模型相媲美的性能。总体而言，MAM表现出卓越的泛化能力，能够有效处理各种图像抠图任务，且参数更少，是统一图像抠图的实用解决方案。我们的代码和模型已在https://github.com/SHI-Labs/Matting-Anything 开源。

English

In this paper, we propose the Matting Anything Model (MAM), an efficient and versatile framework for estimating the alpha matte of any instance in an image with flexible and interactive visual or linguistic user prompt guidance. MAM offers several significant advantages over previous specialized image matting networks: (i) MAM is capable of dealing with various types of image matting, including semantic, instance, and referring image matting with only a single model; (ii) MAM leverages the feature maps from the Segment Anything Model (SAM) and adopts a lightweight Mask-to-Matte (M2M) module to predict the alpha matte through iterative refinement, which has only 2.7 million trainable parameters. (iii) By incorporating SAM, MAM simplifies the user intervention required for the interactive use of image matting from the trimap to the box, point, or text prompt. We evaluate the performance of MAM on various image matting benchmarks, and the experimental results demonstrate that MAM achieves comparable performance to the state-of-the-art specialized image matting models under different metrics on each benchmark. Overall, MAM shows superior generalization ability and can effectively handle various image matting tasks with fewer parameters, making it a practical solution for unified image matting. Our code and models are open-sourced at https://github.com/SHI-Labs/Matting-Anything.