Matting Anything

Abstract

In questo articolo, proponiamo il Matting Anything Model (MAM), un framework efficiente e versatile per stimare l'alpha matte di qualsiasi istanza in un'immagine con una guida flessibile e interattiva basata su prompt visivi o linguistici. MAM offre diversi vantaggi significativi rispetto alle precedenti reti specializzate per il matting delle immagini: (i) MAM è in grado di gestire vari tipi di matting, tra cui matting semantico, matting per istanza e matting referenziale, utilizzando un unico modello; (ii) MAM sfrutta le mappe di feature del Segment Anything Model (SAM) e adotta un modulo leggero Mask-to-Matte (M2M) per prevedere l'alpha matte attraverso un raffinamento iterativo, con soli 2,7 milioni di parametri addestrabili. (iii) Incorporando SAM, MAM semplifica l'intervento dell'utente richiesto per l'uso interattivo del matting, passando dalla trimap a prompt basati su box, punti o testo. Valutiamo le prestazioni di MAM su vari benchmark di matting delle immagini, e i risultati sperimentali dimostrano che MAM raggiunge prestazioni comparabili ai modelli specializzati all'avanguardia in diverse metriche su ciascun benchmark. Nel complesso, MAM mostra una superiore capacità di generalizzazione e può gestire efficacemente vari compiti di matting con un numero ridotto di parametri, rendendolo una soluzione pratica per il matting unificato delle immagini. Il nostro codice e i modelli sono open-source all'indirizzo https://github.com/SHI-Labs/Matting-Anything.

English

In this paper, we propose the Matting Anything Model (MAM), an efficient and versatile framework for estimating the alpha matte of any instance in an image with flexible and interactive visual or linguistic user prompt guidance. MAM offers several significant advantages over previous specialized image matting networks: (i) MAM is capable of dealing with various types of image matting, including semantic, instance, and referring image matting with only a single model; (ii) MAM leverages the feature maps from the Segment Anything Model (SAM) and adopts a lightweight Mask-to-Matte (M2M) module to predict the alpha matte through iterative refinement, which has only 2.7 million trainable parameters. (iii) By incorporating SAM, MAM simplifies the user intervention required for the interactive use of image matting from the trimap to the box, point, or text prompt. We evaluate the performance of MAM on various image matting benchmarks, and the experimental results demonstrate that MAM achieves comparable performance to the state-of-the-art specialized image matting models under different metrics on each benchmark. Overall, MAM shows superior generalization ability and can effectively handle various image matting tasks with fewer parameters, making it a practical solution for unified image matting. Our code and models are open-sourced at https://github.com/SHI-Labs/Matting-Anything.