빠른 세그먼트 애니띵: 모바일 애플리케이션을 위한 경량화된 SAM으로의 접근

초록

Segment Anything Model(SAM)은 관심 객체를 배경에서 분리하기 위한 프롬프트 기반의 비전 기반 모델입니다. Meta 연구팀이 SA 프로젝트를 공개한 이후, SAM은 인상적인 제로샷 전이 성능과 이미지 편집과 같은 세밀한 제어가 가능한 고급 비전 애플리케이션에서 다른 모델과 호환되는 높은 다용도성으로 인해 상당한 주목을 받았습니다. 이러한 사용 사례 중 많은 부분이 모바일 앱과 같은 자원이 제한된 엣지 디바이스에서 실행되어야 합니다. 본 연구에서는 SAM을 모바일 친화적으로 만들기 위해 무거운 이미지 인코더를 경량화된 인코더로 대체하는 것을 목표로 합니다. 원본 SAM 논문에서와 같이 새로운 SAM을 훈련하는 단순한 방법은 특히 제한된 훈련 데이터가 있을 때 만족스럽지 못한 성능을 보입니다. 우리는 이 문제가 주로 이미지 인코더와 마스크 디코더의 결합된 최적화에서 비롯된다는 것을 발견했으며, 이를 계기로 디커플드 디스틸레이션(decoupled distillation)을 제안합니다. 구체적으로, 원본 SAM의 이미지 인코더 ViT-H에서 경량화된 이미지 인코더로 지식을 전이시켜, 원본 SAM의 마스크 디코더와 자동으로 호환될 수 있도록 합니다. 이 훈련은 단일 GPU에서 하루 이내에 완료될 수 있으며, 결과적으로 얻은 경량화된 SAM은 MobileSAM이라고 명명되었습니다. MobileSAM은 원본 SAM보다 60배 이상 작으면서도 동등한 성능을 보입니다. 추론 속도 측면에서, MobileSAM은 이미지당 약 10ms(이미지 인코더 8ms, 마스크 디코더 2ms)로 실행됩니다. 우수한 성능과 더 높은 다용도성을 갖춘 MobileSAM은 동시대의 FastSAM보다 7배 더 작고 4배 더 빠르며, 이는 모바일 애플리케이션에 더 적합함을 의미합니다. MobileSAM 프로젝트의 코드는 https://github.com/ChaoningZhang/MobileSAM에서 제공됩니다.

English

Segment anything model (SAM) is a prompt-guided vision foundation model for cutting out the object of interest from its background. Since Meta research team released the SA project, SAM has attracted significant attention due to its impressive zero-shot transfer performance and high versatility of being compatible with other models for advanced vision applications like image editing with fine-grained control. Many of such use cases need to be run on resource-constraint edge devices, like mobile Apps. In this work, we aim to make SAM mobile-friendly by replacing the heavyweight image encoder with a lightweight one. A naive way to train such a new SAM as in the original SAM paper leads to unsatisfactory performance, especially when limited training sources are available. We find that this is mainly caused by the coupled optimization of the image encoder and mask decoder, motivated by which we propose decoupled distillation. Concretely, we distill the knowledge from the image encoder ViT-H in the original SAM to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM. The training can be completed on a single GPU within less than one day, and the resulting lightweight SAM is termed MobileSAM which is more than 60 times smaller yet performs on par with the original SAM. For inference speed, MobileSAM runs around 10ms per image: 8ms on the image encoder and 2ms on the mask decoder. With superior performance and a higher versatility, our MobileSAM is 7 times smaller and 4 times faster than the concurrent FastSAM, making it more suitable for mobile applications. The code for MobileSAM project is provided at https://github.com/ChaoningZhang/MobileSAM

빠른 세그먼트 애니띵: 모바일 애플리케이션을 위한 경량화된 SAM으로의 접근

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

초록

Support