MobileSAMv2: Segmentazione Rapida di Qualsiasi Cosa in Tutto

Abstract

Il modello Segment Anything (SAM) affronta due compiti di segmentazione pratici ma impegnativi: segmentare qualsiasi cosa (SegAny), che utilizza un punto specifico per prevedere la maschera di un singolo oggetto di interesse, e segmentare tutto (SegEvery), che prevede le maschere per tutti gli oggetti presenti nell'immagine. Ciò che rende SegAny lento per SAM è il suo encoder di immagini pesante, che è stato risolto da MobileSAM attraverso la distillazione di conoscenza disaccoppiata. Tuttavia, il collo di bottiglia dell'efficienza di SegEvery con SAM risiede nel suo decoder di maschere, poiché deve prima generare numerose maschere con prompt ridondanti di ricerca a griglia e poi eseguire un filtraggio per ottenere le maschere finali valide. Proponiamo di migliorarne l'efficienza generando direttamente le maschere finali con solo prompt validi, che possono essere ottenuti attraverso la scoperta di oggetti. Il nostro approccio proposto non solo aiuta a ridurre il tempo totale sul decoder di maschere di almeno 16 volte, ma raggiunge anche prestazioni superiori. Nello specifico, il nostro approccio produce un aumento medio delle prestazioni del 3,6% (42,5% contro 38,9%) per la proposta di oggetti zero-shot sul dataset LVIS con la metrica mask AR@K. I risultati qualitativi mostrano che il nostro approccio genera maschere a grana fine evitando di sovra-segmentare gli oggetti. Questo progetto, che mira a un SegEvery più veloce rispetto al SAM originale, è denominato MobileSAMv2 per distinguerlo da MobileSAM, che mira a un SegAny più veloce. Inoltre, dimostriamo che il nostro nuovo campionamento di prompt è anche compatibile con gli encoder di immagini distillati in MobileSAM, contribuendo a un framework unificato per un SegAny e SegEvery efficienti. Il codice è disponibile allo stesso link del progetto MobileSAM https://github.com/ChaoningZhang/MobileSAM{red{https://github.com/ChaoningZhang/MobileSAM}}.

English

Segment anything model (SAM) addresses two practical yet challenging segmentation tasks: segment anything (SegAny), which utilizes a certain point to predict the mask for a single object of interest, and segment everything (SegEvery), which predicts the masks for all objects on the image. What makes SegAny slow for SAM is its heavyweight image encoder, which has been addressed by MobileSAM via decoupled knowledge distillation. The efficiency bottleneck of SegEvery with SAM, however, lies in its mask decoder because it needs to first generate numerous masks with redundant grid-search prompts and then perform filtering to obtain the final valid masks. We propose to improve its efficiency by directly generating the final masks with only valid prompts, which can be obtained through object discovery. Our proposed approach not only helps reduce the total time on the mask decoder by at least 16 times but also achieves superior performance. Specifically, our approach yields an average performance boost of 3.6\% (42.5\% v.s. 38.9\%) for zero-shot object proposal on the LVIS dataset with the mask AR@K metric. Qualitative results show that our approach generates fine-grained masks while avoiding over-segmenting things. This project targeting faster SegEvery than the original SAM is termed MobileSAMv2 to differentiate from MobileSAM which targets faster SegAny. Moreover, we demonstrate that our new prompt sampling is also compatible with the distilled image encoders in MobileSAM, contributing to a unified framework for efficient SegAny and SegEvery. The code is available at the same link as MobileSAM Project https://github.com/ChaoningZhang/MobileSAM{red{https://github.com/ChaoningZhang/MobileSAM}}. abstract

MobileSAMv2: Segmentazione Rapida di Qualsiasi Cosa in Tutto

MobileSAMv2: Faster Segment Anything to Everything

Abstract

Support