分割和为任何内容加标题

摘要

我们提出了一种方法，可以有效地为“Segment Anything Model”（SAM）增加生成区域描述的能力。SAM在对任何内容进行分割时表现出强大的泛化能力，同时也代表着语义理解的缩写。通过引入一个轻量级的基于查询的特征混合器，我们将区域特定的特征与语言模型的嵌入空间对齐，以便后续生成描述。由于可训练参数数量较少（通常在数千万数量级），这种方法计算成本低、内存使用少、通信带宽消耗小，从而实现了快速且可扩展的训练。为了解决区域描述数据稀缺的问题，我们建议首先在目标检测和分割任务上对模型进行预训练。我们将这一步骤称为弱监督预训练，因为预训练数据仅包含类别名称，而不是完整的句子描述。弱监督预训练使我们能够利用许多公开可用的目标检测和分割数据集。我们进行了大量实验，以展示我们方法的优越性并验证每个设计选择。这项工作为扩展区域描述数据奠定了基础，并为探索将SAM与区域语义相结合的高效方法提供了启示。项目页面以及相关代码可以通过以下链接访问：https://xk-huang.github.io/segment-caption-anything/。

English

We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions), it costs less computation, less memory usage, and less communication bandwidth, resulting in both fast and scalable training. To address the scarcity problem of regional caption data, we propose to first pre-train our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pre-training data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. The project page, along with the associated code, can be accessed via the following https://xk-huang.github.io/segment-caption-anything/.

分割和为任何内容加标题

Segment and Caption Anything

摘要

Support