ChatPaper.aiChatPaper

COLA:如何使视觉-语言模型适应具有属性的局部化对象组合?

COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?

May 5, 2023
作者: Arijit Ray, Filip Radenovic, Abhimanyu Dubey, Bryan A. Plummer, Ranjay Krishna, Kate Saenko
cs.AI

摘要

组合推理是人类视觉智能的标志;然而,尽管大型视觉-语言模型的规模庞大,它们仍然难以表示通过将对象与其属性组合而成的简单组合。为了衡量这种缺乏组合能力,我们设计了Cola,一个文本到图像检索基准,用于组合带属性的定位对象。利用Cola作为实验平台,我们探索建模设计,以使预训练的视觉-语言模型能够对附加到多个对象上的多个属性进行组合推理。我们在两个开创性的视觉-语言模型上探索了6种微调策略,使用了3个微调数据集和2个测试基准(Cola和CREPE)。令人惊讶的是,我们的最佳微调策略将一个拥有151M参数的CLIP,其在预训练期间分别对图像和语言进行编码,改进到与一个使用多模态Transformer编码器在预训练期间同时关注视觉和语言模态的241M参数FLAVA一样出色。这种最佳微调策略是一个轻量级的多模态适配器,可以同时关注预训练模型生成的图像和语言特征。我们展示了这比常见策略如提示/微调或调整相同数量的单模态层效果更好。
English
Compositional reasoning is a hallmark of human visual intelligence; yet despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. Using Cola as a testbed, we explore modeling designs to adapt pre-trained vision-language models to reason compositionally about multiple attributes attached to multiple objects. We explore 6 finetuning strategies on 2 seminal vision-language models, using 3 finetuning datasets and 2 test benchmarks (Cola and CREPE). Surprisingly, our optimal finetuning strategy improves a 151M parameter CLIP, which disjointly encodes image and language during pretraining, to perform as well as a 241M parameter FLAVA, which uses a multi-modal transformer encoder during pretraining to attend over both vision and language modalities. This optimal finetuning strategy is a lightweight multi-modal adapter that jointly attends over both image and language features generated by the pretrained model. We show this works better than common strategies such as prompt/fine-tuning, or tuning a comparable number of unimodal layers.
PDF21December 15, 2024