MobileSAMv2: 高速なセグメンテーション・エニシング

要旨

Segment Anything Model（SAM）は、2つの実用的でありながら困難なセグメンテーションタスクに対応しています。1つは、特定のポイントを利用して単一の対象物のマスクを予測する「Segment Anything（SegAny）」、もう1つは、画像内のすべての対象物のマスクを予測する「Segment Everything（SegEvery）」です。SegAnyがSAMにおいて遅い原因は、その重い画像エンコーダにあり、これはMobileSAMによって分離型知識蒸留を用いて解決されました。しかし、SegEveryの効率のボトルネックは、そのマスクデコーダにあります。なぜなら、まず冗長なグリッドサーチプロンプトを用いて多数のマスクを生成し、その後フィルタリングを行って最終的な有効なマスクを取得する必要があるためです。我々は、オブジェクトディスカバリーを通じて得られる有効なプロンプトのみを用いて直接最終的なマスクを生成することで、その効率を改善することを提案します。提案されたアプローチは、マスクデコーダの総時間を少なくとも16倍削減するだけでなく、優れた性能を達成します。具体的には、LVISデータセットにおけるゼロショットオブジェクト提案において、マスクAR@Kメトリックで平均3.6％（42.5％対38.9％）の性能向上をもたらします。定性的な結果は、我々のアプローチが細かいマスクを生成しつつ、過剰なセグメンテーションを回避することを示しています。このプロジェクトは、元のSAMよりも高速なSegEveryを目指しており、MobileSAMv2と名付けられ、より高速なSegAnyを目指すMobileSAMと区別されます。さらに、我々の新しいプロンプトサンプリングが、MobileSAMの蒸留された画像エンコーダとも互換性があることを示し、効率的なSegAnyとSegEveryのための統一されたフレームワークに貢献します。コードはMobileSAMプロジェクトと同じリンクで利用可能です。 https://github.com/ChaoningZhang/MobileSAM{red{https://github.com/ChaoningZhang/MobileSAM}}。

English

Segment anything model (SAM) addresses two practical yet challenging segmentation tasks: segment anything (SegAny), which utilizes a certain point to predict the mask for a single object of interest, and segment everything (SegEvery), which predicts the masks for all objects on the image. What makes SegAny slow for SAM is its heavyweight image encoder, which has been addressed by MobileSAM via decoupled knowledge distillation. The efficiency bottleneck of SegEvery with SAM, however, lies in its mask decoder because it needs to first generate numerous masks with redundant grid-search prompts and then perform filtering to obtain the final valid masks. We propose to improve its efficiency by directly generating the final masks with only valid prompts, which can be obtained through object discovery. Our proposed approach not only helps reduce the total time on the mask decoder by at least 16 times but also achieves superior performance. Specifically, our approach yields an average performance boost of 3.6\% (42.5\% v.s. 38.9\%) for zero-shot object proposal on the LVIS dataset with the mask AR@K metric. Qualitative results show that our approach generates fine-grained masks while avoiding over-segmenting things. This project targeting faster SegEvery than the original SAM is termed MobileSAMv2 to differentiate from MobileSAM which targets faster SegAny. Moreover, we demonstrate that our new prompt sampling is also compatible with the distilled image encoders in MobileSAM, contributing to a unified framework for efficient SegAny and SegEvery. The code is available at the same link as MobileSAM Project https://github.com/ChaoningZhang/MobileSAM{red{https://github.com/ChaoningZhang/MobileSAM}}. abstract

MobileSAMv2: 高速なセグメンテーション・エニシング

MobileSAMv2: Faster Segment Anything to Everything

要旨

Summary

Support

Support