MobileSAMv2: 모든 것을 더 빠르게 분할하기

초록

Segment Anything Model(SAM)은 두 가지 실용적이면서도 도전적인 세그멘테이션 작업을 해결합니다: 특정 포인트를 활용하여 관심 객체 하나에 대한 마스크를 예측하는 'Segment Anything'(SegAny)와 이미지 상의 모든 객체에 대한 마스크를 예측하는 'Segment Everything'(SegEvery)입니다. SAM에서 SegAny가 느린 이유는 무거운 이미지 인코더 때문이며, 이는 MobileSAM이 분리된 지식 증류를 통해 해결했습니다. 그러나 SAM을 사용한 SegEvery의 효율성 병목은 마스크 디코더에 있습니다. 이는 먼저 중복된 그리드 탐색 프롬프트로 수많은 마스크를 생성한 후 필터링을 통해 최종 유효 마스크를 얻어야 하기 때문입니다. 우리는 객체 탐지를 통해 유효 프롬프트만을 사용하여 최종 마스크를 직접 생성함으로써 효율성을 개선할 것을 제안합니다. 우리가 제안한 접근 방식은 마스크 디코더의 총 처리 시간을 최소 16배 이상 줄이는 데 도움을 줄 뿐만 아니라 더 우수한 성능을 달성합니다. 구체적으로, 우리의 접근 방식은 LVIS 데이터셋에서 제로샷 객체 제안에 대해 마스크 AR@K 지표 기준으로 평균 3.6%(42.5% 대 38.9%)의 성능 향상을 보여줍니다. 질적 결과는 우리의 접근 방식이 과도한 세분화를 피하면서도 정교한 마스크를 생성함을 보여줍니다. 원본 SAM보다 더 빠른 SegEvery를 목표로 하는 이 프로젝트는 더 빠른 SegAny를 목표로 하는 MobileSAM과 구분하기 위해 MobileSAMv2로 명명되었습니다. 또한, 우리는 새로운 프롬프트 샘플링이 MobileSAM의 증류된 이미지 인코더와도 호환되어 효율적인 SegAny와 SegEvery를 위한 통합 프레임워크에 기여함을 입증합니다. 코드는 MobileSAM 프로젝트와 동일한 링크에서 이용 가능합니다. https://github.com/ChaoningZhang/MobileSAM{red{https://github.com/ChaoningZhang/MobileSAM}}.

English

Segment anything model (SAM) addresses two practical yet challenging segmentation tasks: segment anything (SegAny), which utilizes a certain point to predict the mask for a single object of interest, and segment everything (SegEvery), which predicts the masks for all objects on the image. What makes SegAny slow for SAM is its heavyweight image encoder, which has been addressed by MobileSAM via decoupled knowledge distillation. The efficiency bottleneck of SegEvery with SAM, however, lies in its mask decoder because it needs to first generate numerous masks with redundant grid-search prompts and then perform filtering to obtain the final valid masks. We propose to improve its efficiency by directly generating the final masks with only valid prompts, which can be obtained through object discovery. Our proposed approach not only helps reduce the total time on the mask decoder by at least 16 times but also achieves superior performance. Specifically, our approach yields an average performance boost of 3.6\% (42.5\% v.s. 38.9\%) for zero-shot object proposal on the LVIS dataset with the mask AR@K metric. Qualitative results show that our approach generates fine-grained masks while avoiding over-segmenting things. This project targeting faster SegEvery than the original SAM is termed MobileSAMv2 to differentiate from MobileSAM which targets faster SegAny. Moreover, we demonstrate that our new prompt sampling is also compatible with the distilled image encoders in MobileSAM, contributing to a unified framework for efficient SegAny and SegEvery. The code is available at the same link as MobileSAM Project https://github.com/ChaoningZhang/MobileSAM{red{https://github.com/ChaoningZhang/MobileSAM}}. abstract

MobileSAMv2: 모든 것을 더 빠르게 분할하기

MobileSAMv2: Faster Segment Anything to Everything

초록

Summary

Support

Support