更快的分割任務：針對輕量級移動應用的 SAM

摘要

分割任務模型（SAM）是一種受提示引導的視覺基礎模型，用於從背景中切割出感興趣的物體。自從 Meta 研究團隊發布了 SA 項目以來，SAM 因其卓越的零樣本轉移性能和與其他模型兼容的高度多功能性而受到了重視，可用於像具有精細控制的圖像編輯等高級視覺應用。許多此類用例需要在資源受限的邊緣設備上運行，如移動應用程式。在這項工作中，我們的目標是通過用輕量級圖像編碼器替換笨重的圖像編碼器，使 SAM 適合移動設備。按照原始 SAM 論文中訓練這種新 SAM 的天真方式會導致令人不滿的性能，特別是當訓練資源有限時。我們發現，這主要是由圖像編碼器和遮罩解碼器的耦合優化引起的，基於這一點，我們提出了解耦蒸餾。具體而言，我們從原始 SAM 中的圖像編碼器 ViT-H 中提煉知識到一個輕量級圖像編碼器，該編碼器可以自動與原始 SAM 中的遮罩解碼器兼容。訓練可以在一天內在單個 GPU 上完成，結果得到的輕量級 SAM 被稱為 MobileSAM，體積縮小了 60 多倍，但性能與原始 SAM 相當。對於推理速度，MobileSAM 每張圖像運行約 10 毫秒：圖像編碼器為 8 毫秒，遮罩解碼器為 2 毫秒。具有卓越性能和更高多功能性，我們的 MobileSAM 比同時的 FastSAM 體積小 7 倍，速度快 4 倍，更適合移動應用。MobileSAM 項目的代碼可在 https://github.com/ChaoningZhang/MobileSAM 上找到。

English

Segment anything model (SAM) is a prompt-guided vision foundation model for cutting out the object of interest from its background. Since Meta research team released the SA project, SAM has attracted significant attention due to its impressive zero-shot transfer performance and high versatility of being compatible with other models for advanced vision applications like image editing with fine-grained control. Many of such use cases need to be run on resource-constraint edge devices, like mobile Apps. In this work, we aim to make SAM mobile-friendly by replacing the heavyweight image encoder with a lightweight one. A naive way to train such a new SAM as in the original SAM paper leads to unsatisfactory performance, especially when limited training sources are available. We find that this is mainly caused by the coupled optimization of the image encoder and mask decoder, motivated by which we propose decoupled distillation. Concretely, we distill the knowledge from the image encoder ViT-H in the original SAM to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM. The training can be completed on a single GPU within less than one day, and the resulting lightweight SAM is termed MobileSAM which is more than 60 times smaller yet performs on par with the original SAM. For inference speed, MobileSAM runs around 10ms per image: 8ms on the image encoder and 2ms on the mask decoder. With superior performance and a higher versatility, our MobileSAM is 7 times smaller and 4 times faster than the concurrent FastSAM, making it more suitable for mobile applications. The code for MobileSAM project is provided at https://github.com/ChaoningZhang/MobileSAM

更快的分割任務：針對輕量級移動應用的 SAM

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

摘要

Support