高质量分割任何内容

摘要

最近的分段任意模型（SAM）代表了在扩展分割模型方面的重大进展，实现了强大的零样本能力和灵活的提示功能。尽管SAM经过了11亿个蒙版的训练，但在许多情况下，SAM的蒙版预测质量仍然不足，特别是在处理结构复杂的对象时。我们提出了HQ-SAM，为SAM配备了准确分割任何对象的能力，同时保持了SAM的原始提示设计、高效性和零样本泛化能力。我们精心设计了重用和保留SAM预训练模型权重的方法，仅引入了最少的额外参数和计算。我们设计了一个可学习的高质量输出标记，将其注入SAM的蒙版解码器中，负责预测高质量蒙版。我们不仅仅将其应用于蒙版解码器特征，还首先将其与早期和最终的ViT特征融合，以改善蒙版细节。为了训练我们引入的可学习参数，我们组成了一个包含来自多个来源的4.4万个细粒度蒙版的数据集。HQ-SAM仅在引入的4.4万个蒙版数据集上进行训练，仅需在8个GPU上花费4小时。我们展示了HQ-SAM在9个不同下游任务的多样化分割数据集中的有效性，其中有7个是通过零样本转移协议进行评估的。我们的代码和模型将在https://github.com/SysCV/SAM-HQ发布。

English

The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced detaset of 44k masks, which takes only 4 hours on 8 GPUs. We show the efficacy of HQ-SAM in a suite of 9 diverse segmentation datasets across different downstream tasks, where 7 out of them are evaluated in a zero-shot transfer protocol. Our code and models will be released at https://github.com/SysCV/SAM-HQ.

高质量分割任何内容

Segment Anything in High Quality

摘要

Support