在圖像-文本預訓練中提升細粒度理解

摘要

我們介紹了SPARse Fine-grained Contrastive Alignment（SPARC），這是一種簡單的方法，用於從圖像-文本對中預訓練更細粒度的多模態表示。鑒於多個圖像區塊通常對應單詞，我們建議為標題中的每個標記學習一組圖像區塊的分組。為了實現這一點，我們使用圖像區塊和語言標記之間的稀疏相似度度量，並為每個標記計算一個語言分組的視覺嵌入，作為區塊的加權平均值。然後，通過一個細粒度的序列損失將標記和語言分組的視覺嵌入進行對比，該損失僅取決於個別樣本，不需要其他批次樣本作為負樣本。這使得可以以一種計算成本低廉的方式學習更詳細的信息。SPARC將這種細粒度損失與全局圖像和文本嵌入之間的對比損失結合在一起，以學習同時編碼全局和局部信息的表示。我們對我們提出的方法進行了全面評估，並展示了在依賴於粗粒度信息的圖像級任務（例如分類）以及依賴於細粒度信息的區域級任務（例如檢索、物體檢測和分割）上，相對競爭方法的性能有所提升。此外，SPARC提高了模型的忠實度和基礎視覺語言模型中的標註能力。

English

We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose to learn a grouping of image patches for every token in the caption. To achieve this, we use a sparse similarity metric between image patches and language tokens and compute for each token a language-grouped vision embedding as the weighted average of patches. The token and language-grouped vision embeddings are then contrasted through a fine-grained sequence-wise loss that only depends on individual samples and does not require other batch samples as negatives. This enables more detailed information to be learned in a computationally inexpensive manner. SPARC combines this fine-grained loss with a contrastive loss between global image and text embeddings to learn representations that simultaneously encode global and local information. We thoroughly evaluate our proposed method and show improved performance over competing approaches both on image-level tasks relying on coarse-grained information, e.g. classification, as well as region-level tasks relying on fine-grained information, e.g. retrieval, object detection, and segmentation. Moreover, SPARC improves model faithfulness and captioning in foundational vision-language models.

在圖像-文本預訓練中提升細粒度理解

Improving fine-grained understanding in image-text pre-training

摘要

Support