MobileSAMv2：更快速地將任何東西分割為任何東西

摘要

分割任務模型（SAM）解決了兩個實際但具有挑戰性的分割任務：分割任何物體（SegAny），利用特定點來預測感興趣單個物體的遮罩，以及分割所有物體（SegEvery），預測圖像上所有物體的遮罩。SegAny對SAM而言速度較慢的原因在於其龐大的圖像編碼器，MobileSAM通過解耦合知識蒸餾來解決這個問題。然而，SegEvery與SAM的效率瓶頸在於其遮罩解碼器，因為它需要首先生成大量具有冗餘網格搜索提示的遮罩，然後進行過濾以獲得最終有效遮罩。我們建議通過直接生成僅具有效提示的最終遮罩來提高其效率，這可以通過物體發現獲得。我們提出的方法不僅有助於將遮罩解碼器的總時間至少減少16倍，還實現了卓越的性能。具體而言，我們的方法在LVIS數據集上的零樣本物體提議中，使用遮罩AR@K指標實現了平均性能提升3.6％（42.5％與38.9％）。定性結果顯示，我們的方法生成了精細的遮罩，同時避免了對物體的過度分割。這個旨在實現比原始SAM更快速SegEvery的項目被稱為MobileSAMv2，以區別於針對更快速SegAny的MobileSAM。此外，我們展示了我們的新提示抽樣也與MobileSAM中的蒸餾圖像編碼器相容，為高效的SegAny和SegEvery提供了統一框架。代碼可在MobileSAM項目的相同鏈接中找到。

English

Segment anything model (SAM) addresses two practical yet challenging segmentation tasks: segment anything (SegAny), which utilizes a certain point to predict the mask for a single object of interest, and segment everything (SegEvery), which predicts the masks for all objects on the image. What makes SegAny slow for SAM is its heavyweight image encoder, which has been addressed by MobileSAM via decoupled knowledge distillation. The efficiency bottleneck of SegEvery with SAM, however, lies in its mask decoder because it needs to first generate numerous masks with redundant grid-search prompts and then perform filtering to obtain the final valid masks. We propose to improve its efficiency by directly generating the final masks with only valid prompts, which can be obtained through object discovery. Our proposed approach not only helps reduce the total time on the mask decoder by at least 16 times but also achieves superior performance. Specifically, our approach yields an average performance boost of 3.6\% (42.5\% v.s. 38.9\%) for zero-shot object proposal on the LVIS dataset with the mask AR@K metric. Qualitative results show that our approach generates fine-grained masks while avoiding over-segmenting things. This project targeting faster SegEvery than the original SAM is termed MobileSAMv2 to differentiate from MobileSAM which targets faster SegAny. Moreover, we demonstrate that our new prompt sampling is also compatible with the distilled image encoders in MobileSAM, contributing to a unified framework for efficient SegAny and SegEvery. The code is available at the same link as MobileSAM Project https://github.com/ChaoningZhang/MobileSAM{red{https://github.com/ChaoningZhang/MobileSAM}}. abstract

MobileSAMv2：更快速地將任何東西分割為任何東西

MobileSAMv2: Faster Segment Anything to Everything

摘要

Support