Matting Anything

Resumo

Neste artigo, propomos o Matting Anything Model (MAM), uma estrutura eficiente e versátil para estimar o matte alfa de qualquer instância em uma imagem com orientação flexível e interativa por meio de prompts visuais ou linguísticos do usuário. O MAM oferece várias vantagens significativas em relação às redes especializadas anteriores de matte de imagem: (i) O MAM é capaz de lidar com vários tipos de matte de imagem, incluindo matte semântico, de instância e de imagem referenciada, utilizando apenas um único modelo; (ii) O MAM aproveita os mapas de características do Segment Anything Model (SAM) e adota um módulo leve Mask-to-Matte (M2M) para prever o matte alfa por meio de refinamento iterativo, que possui apenas 2,7 milhões de parâmetros treináveis. (iii) Ao incorporar o SAM, o MAM simplifica a intervenção do usuário necessária para o uso interativo do matte de imagem, passando do trimap para o prompt de caixa, ponto ou texto. Avaliamos o desempenho do MAM em vários benchmarks de matte de imagem, e os resultados experimentais demonstram que o MAM alcança desempenho comparável aos modelos especializados de matte de imagem state-of-the-art sob diferentes métricas em cada benchmark. No geral, o MAM mostra uma capacidade de generalização superior e pode lidar efetivamente com várias tarefas de matte de imagem com menos parâmetros, tornando-o uma solução prática para o matte de imagem unificado. Nosso código e modelos estão disponíveis em https://github.com/SHI-Labs/Matting-Anything.

English

In this paper, we propose the Matting Anything Model (MAM), an efficient and versatile framework for estimating the alpha matte of any instance in an image with flexible and interactive visual or linguistic user prompt guidance. MAM offers several significant advantages over previous specialized image matting networks: (i) MAM is capable of dealing with various types of image matting, including semantic, instance, and referring image matting with only a single model; (ii) MAM leverages the feature maps from the Segment Anything Model (SAM) and adopts a lightweight Mask-to-Matte (M2M) module to predict the alpha matte through iterative refinement, which has only 2.7 million trainable parameters. (iii) By incorporating SAM, MAM simplifies the user intervention required for the interactive use of image matting from the trimap to the box, point, or text prompt. We evaluate the performance of MAM on various image matting benchmarks, and the experimental results demonstrate that MAM achieves comparable performance to the state-of-the-art specialized image matting models under different metrics on each benchmark. Overall, MAM shows superior generalization ability and can effectively handle various image matting tasks with fewer parameters, making it a practical solution for unified image matting. Our code and models are open-sourced at https://github.com/SHI-Labs/Matting-Anything.