检索增强对比视觉文本模型

摘要

对比图像-文本模型，如CLIP，构成许多最先进系统的基础。虽然它们擅长识别常见的通用概念，但在罕见甚至在预训练数据集中不存在的细粒度实体上仍然存在困难。因此，它们成功的关键因素之一是使用大规模策划的预训练数据，旨在在预训练阶段扩展它们可以记忆的概念集。在这项工作中，我们探索了一种将细粒度知识直接编码到模型参数的替代方法：我们改为训练模型从外部存储器中检索这些知识。具体而言，我们建议为现有的视觉-文本模型增加能力，使其能够在推断时从存储器中检索跨模态的信息来优化它们的嵌入，从而极大地提高它们的零样本预测能力。值得注意的是，我们展示可以通过在冻结的CLIP之上使用轻量级、单层的融合Transformer 来实现这一点。我们的实验验证了我们的检索增强对比（RECO）训练在几项具有挑战性的细粒度任务上显著提高了CLIP的性能：例如，在斯坦福汽车数据集上提高了+10.9，在CUB-2011上提高了+10.2，在最近的OVEN基准测试上提高了+7.3。

English

Contrastive image-text models such as CLIP form the building blocks of many state-of-the-art systems. While they excel at recognizing common generic concepts, they still struggle on fine-grained entities which are rare, or even absent from the pre-training dataset. Hence, a key ingredient to their success has been the use of large-scale curated pre-training data aiming at expanding the set of concepts that they can memorize during the pre-training stage. In this work, we explore an alternative to encoding fine-grained knowledge directly into the model's parameters: we instead train the model to retrieve this knowledge from an external memory. Specifically, we propose to equip existing vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time, which greatly improves their zero-shot predictions. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks: for example +10.9 on Stanford Cars, +10.2 on CUB-2011 and +7.3 on the recent OVEN benchmark.

检索增强对比视觉文本模型

Retrieval-Enhanced Contrastive Vision-Text Models

摘要

Support