MobileSAMv2：更快速地将任何内容分段至任何内容

摘要

分割任意物体模型（SAM）解决了两个实际且具有挑战性的分割任务：分割任意物体（SegAny），利用特定点预测感兴趣单个对象的蒙版，以及分割所有物体（SegEvery），预测图像上所有物体的蒙版。SegAny在SAM中变慢的原因是其庞大的图像编码器，MobileSAM通过解耦合知识蒸馏来解决了这个问题。然而，SAM中SegEvery的效率瓶颈在于其蒙版解码器，因为它需要首先使用冗余的网格搜索提示生成大量蒙版，然后执行过滤以获得最终有效的蒙版。我们提出通过直接生成仅具有有效提示的最终蒙版来提高其效率，这些提示可以通过对象发现获得。我们提出的方法不仅有助于将蒙版解码器的总时间至少减少16倍，而且实现了卓越的性能。具体而言，我们的方法在LVIS数据集上零样本对象提议的蒙版AR@K指标上平均性能提升了3.6%（42.5%对38.9%）。定性结果显示，我们的方法生成了细粒度蒙版，同时避免了对物体进行过度分割。这个旨在比原始SAM更快的SegEvery的项目被称为MobileSAMv2，以区别于旨在更快的SegAny的MobileSAM。此外，我们证明我们的新提示采样也与MobileSAM中的蒸馏图像编码器兼容，为高效的SegAny和SegEvery提供了统一框架。代码可在与MobileSAM相同的链接处找到。MobileSAM项目链接为https://github.com/ChaoningZhang/MobileSAM。

English

Segment anything model (SAM) addresses two practical yet challenging segmentation tasks: segment anything (SegAny), which utilizes a certain point to predict the mask for a single object of interest, and segment everything (SegEvery), which predicts the masks for all objects on the image. What makes SegAny slow for SAM is its heavyweight image encoder, which has been addressed by MobileSAM via decoupled knowledge distillation. The efficiency bottleneck of SegEvery with SAM, however, lies in its mask decoder because it needs to first generate numerous masks with redundant grid-search prompts and then perform filtering to obtain the final valid masks. We propose to improve its efficiency by directly generating the final masks with only valid prompts, which can be obtained through object discovery. Our proposed approach not only helps reduce the total time on the mask decoder by at least 16 times but also achieves superior performance. Specifically, our approach yields an average performance boost of 3.6\% (42.5\% v.s. 38.9\%) for zero-shot object proposal on the LVIS dataset with the mask AR@K metric. Qualitative results show that our approach generates fine-grained masks while avoiding over-segmenting things. This project targeting faster SegEvery than the original SAM is termed MobileSAMv2 to differentiate from MobileSAM which targets faster SegAny. Moreover, we demonstrate that our new prompt sampling is also compatible with the distilled image encoders in MobileSAM, contributing to a unified framework for efficient SegAny and SegEvery. The code is available at the same link as MobileSAM Project https://github.com/ChaoningZhang/MobileSAM{red{https://github.com/ChaoningZhang/MobileSAM}}. abstract

MobileSAMv2：更快速地将任何内容分段至任何内容

MobileSAMv2: Faster Segment Anything to Everything

摘要

Support