分割和標註任何內容

摘要

我們提出了一種方法，可以有效地為「Segment Anything Model」（SAM）增加生成區域標題的能力。SAM在對任何區域進行分割時表現出強大的泛化能力，同時又簡稱為語義理解。通過引入輕量級的基於查詢的特徵混合器，我們將區域特定的特徵與語言模型的嵌入空間對齊，以供後續標題生成使用。由於可訓練參數的數量較少（通常在數千萬的量級），這樣做成本更低，計算量更小，內存使用更少，通信帶寬更小，從而實現了快速且可擴展的訓練。為了解決區域標題數據稀缺的問題，我們建議首先在對象檢測和分割任務上對我們的模型進行預訓練。我們將這一步驟稱為弱監督預訓練，因為預訓練數據僅包含類別名稱，而不是完整的句子描述。弱監督預訓練使我們能夠利用許多公開可用的對象檢測和分割數據集。我們進行了大量實驗，以展示我們方法的優越性並驗證每個設計選擇。這項工作是擴大區域標題數據的一個起點，並為探索將SAM與區域語義相結合的有效方法提供了啟示。項目頁面以及相關代碼可通過以下網址訪問：https://xk-huang.github.io/segment-caption-anything/。

English

We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions), it costs less computation, less memory usage, and less communication bandwidth, resulting in both fast and scalable training. To address the scarcity problem of regional caption data, we propose to first pre-train our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pre-training data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. The project page, along with the associated code, can be accessed via the following https://xk-huang.github.io/segment-caption-anything/.

分割和標註任何內容

Segment and Caption Anything

摘要

Support