COLA：如何適應視覺語言模型以組合具有屬性的局部化物件？

摘要

組合推理是人類視覺智能的一個標誌；然而，儘管大型視覺語言模型的規模龐大，它們卻難以通過結合物體與其屬性來表示簡單的組合。為了衡量這種缺乏組合能力，我們設計了 Cola，一個文本到圖像檢索基準，用於組合帶有屬性的局部化物體。利用 Cola 作為測試平臺，我們探索建模設計，以適應預訓練的視覺語言模型對附加到多個物體的多個屬性進行組合推理。我們在兩個具有開創性的視覺語言模型上探索了 6 種微調策略，使用 3 個微調數據集和 2 個測試基準（Cola 和 CREPE）。令人驚訝的是，我們的最佳微調策略將一個具有 151M 參數的 CLIP，該模型在預訓練期間分別編碼圖像和語言，提升到與一個使用多模態變壓器編碼器在預訓練期間對視覺和語言模態進行關注的 241M 參數 FLAVA 一樣出色。這種最佳微調策略是一個輕量級的多模態適配器，它聯合關注預訓練模型生成的圖像和語言特徵。我們展示了這比常見策略如提示/微調或調整相同數量的單模態層效果更好。

English

Compositional reasoning is a hallmark of human visual intelligence; yet despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. Using Cola as a testbed, we explore modeling designs to adapt pre-trained vision-language models to reason compositionally about multiple attributes attached to multiple objects. We explore 6 finetuning strategies on 2 seminal vision-language models, using 3 finetuning datasets and 2 test benchmarks (Cola and CREPE). Surprisingly, our optimal finetuning strategy improves a 151M parameter CLIP, which disjointly encodes image and language during pretraining, to perform as well as a 241M parameter FLAVA, which uses a multi-modal transformer encoder during pretraining to attend over both vision and language modalities. This optimal finetuning strategy is a lightweight multi-modal adapter that jointly attends over both image and language features generated by the pretrained model. We show this works better than common strategies such as prompt/fine-tuning, or tuning a comparable number of unimodal layers.

COLA：如何適應視覺語言模型以組合具有屬性的局部化物件？

COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?

摘要

Support