更快的分割任意物体：面向移动应用的轻量级SAM

摘要

分割任意模型（SAM）是一个由提示引导的视觉基础模型，用于从背景中剪切出感兴趣的对象。自Meta研究团队发布SA项目以来，SAM因其令人印象深刻的零次迁移性能和与其他模型兼容的高通用性而受到了广泛关注，可用于高级视觉应用，如具有细粒度控制的图像编辑。许多此类用例需要在资源受限的边缘设备上运行，如移动应用程序。在这项工作中，我们旨在通过用轻量级图像编码器替换笨重的图像编码器，使SAM适用于移动设备。按照原始SAM论文中训练这样一个新SAM的天真方式会导致不令人满意的性能，特别是当训练资源有限时。我们发现，这主要是由图像编码器和掩模解码器的耦合优化引起的，基于这一发现，我们提出了解耦蒸馏。具体而言，我们从原始SAM中的图像编码器ViT-H中提炼知识，转移到一个轻量级图像编码器，该编码器可以自动与原始SAM中的掩模解码器兼容。训练可以在单个GPU上在不到一天的时间内完成，得到的轻量级SAM被称为MobileSAM，体积缩小了60多倍，但性能与原始SAM相当。对于推理速度，MobileSAM每张图像运行约10毫秒：图像编码器为8毫秒，掩模解码器为2毫秒。凭借卓越的性能和更高的通用性，我们的MobileSAM比同时进行的FastSAM小7倍，快4倍，更适合移动应用程序。MobileSAM项目的代码可在https://github.com/ChaoningZhang/MobileSAM上找到。

English

Segment anything model (SAM) is a prompt-guided vision foundation model for cutting out the object of interest from its background. Since Meta research team released the SA project, SAM has attracted significant attention due to its impressive zero-shot transfer performance and high versatility of being compatible with other models for advanced vision applications like image editing with fine-grained control. Many of such use cases need to be run on resource-constraint edge devices, like mobile Apps. In this work, we aim to make SAM mobile-friendly by replacing the heavyweight image encoder with a lightweight one. A naive way to train such a new SAM as in the original SAM paper leads to unsatisfactory performance, especially when limited training sources are available. We find that this is mainly caused by the coupled optimization of the image encoder and mask decoder, motivated by which we propose decoupled distillation. Concretely, we distill the knowledge from the image encoder ViT-H in the original SAM to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM. The training can be completed on a single GPU within less than one day, and the resulting lightweight SAM is termed MobileSAM which is more than 60 times smaller yet performs on par with the original SAM. For inference speed, MobileSAM runs around 10ms per image: 8ms on the image encoder and 2ms on the mask decoder. With superior performance and a higher versatility, our MobileSAM is 7 times smaller and 4 times faster than the concurrent FastSAM, making it more suitable for mobile applications. The code for MobileSAM project is provided at https://github.com/ChaoningZhang/MobileSAM

更快的分割任意物体：面向移动应用的轻量级SAM

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

摘要

Support