抠图任意物体

摘要

本文提出了Matting Anything Model（MAM），這是一個高效且多功能的框架，用於估計圖像中任何實例的alpha遮罩，並提供靈活且互動式的視覺或語言提示指導。MAM相較於先前專門的圖像遮罩網絡具有幾個顯著優勢：(i) MAM能夠處理各種類型的圖像遮罩，包括語義、實例和參考圖像遮罩，僅使用單個模型；(ii) MAM利用Segment Anything Model（SAM）的特徵映射，並採用輕量級的Mask-to-Matte（M2M）模塊通過迭代細化來預測alpha遮罩，僅具有270萬可訓練參數；(iii) 通過整合SAM，MAM簡化了互動式圖像遮罩使用所需的用戶干預，從trimap到框、點或文本提示。我們在各種圖像遮罩基準測試中評估了MAM的性能，實驗結果表明，MAM在每個基準測試中在不同指標下實現了與最先進的專門圖像遮罩模型可比的性能。總的來說，MAM展現出卓越的泛化能力，能夠有效處理各種圖像遮罩任務，並使用更少的參數，使其成為統一圖像遮罩的實用解決方案。我們的代碼和模型在https://github.com/SHI-Labs/Matting-Anything 上開源。

English

In this paper, we propose the Matting Anything Model (MAM), an efficient and versatile framework for estimating the alpha matte of any instance in an image with flexible and interactive visual or linguistic user prompt guidance. MAM offers several significant advantages over previous specialized image matting networks: (i) MAM is capable of dealing with various types of image matting, including semantic, instance, and referring image matting with only a single model; (ii) MAM leverages the feature maps from the Segment Anything Model (SAM) and adopts a lightweight Mask-to-Matte (M2M) module to predict the alpha matte through iterative refinement, which has only 2.7 million trainable parameters. (iii) By incorporating SAM, MAM simplifies the user intervention required for the interactive use of image matting from the trimap to the box, point, or text prompt. We evaluate the performance of MAM on various image matting benchmarks, and the experimental results demonstrate that MAM achieves comparable performance to the state-of-the-art specialized image matting models under different metrics on each benchmark. Overall, MAM shows superior generalization ability and can effectively handle various image matting tasks with fewer parameters, making it a practical solution for unified image matting. Our code and models are open-sourced at https://github.com/SHI-Labs/Matting-Anything.