以高品質分割任何內容

摘要

最近的「Segment Anything Model」（SAM）代表了在擴展分割模型方面的一大飛躍，使其具有強大的零樣本能力和靈活的提示功能。儘管SAM是通過訓練11億個遮罩來訓練的，但在許多情況下，SAM的遮罩預測質量仍然不足，特別是在處理結構複雜的物體時。我們提出了HQ-SAM，為SAM配備了準確分割任何對象的能力，同時保持SAM的原始提示設計、效率和零樣本泛化能力。我們精心設計，重複使用和保留了SAM的預訓練模型權重，僅引入了最少的額外參數和計算。我們設計了一個可學習的高質量輸出標記，將其注入到SAM的遮罩解碼器中，負責預測高質量的遮罩。我們不僅將其應用於遮罩解碼器特徵，還首先將其與早期和最終的ViT特徵融合，以改善遮罩細節。為了訓練我們引入的可學習參數，我們組成了一個包含來自多個來源的4.4萬個精細遮罩的數據集。HQ-SAM僅在引入的4.4萬個遮罩數據集上進行訓練，僅需8個GPU，僅需4小時。我們展示了HQ-SAM在9個不同下游任務的多個分割數據集中的有效性，其中有7個是通過零樣本轉移協議進行評估。我們的代碼和模型將在https://github.com/SysCV/SAM-HQ 上發布。

English

The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced detaset of 44k masks, which takes only 4 hours on 8 GPUs. We show the efficacy of HQ-SAM in a suite of 9 diverse segmentation datasets across different downstream tasks, where 7 out of them are evaluated in a zero-shot transfer protocol. Our code and models will be released at https://github.com/SysCV/SAM-HQ.

以高品質分割任何內容

Segment Anything in High Quality

摘要

Support