GeneCIS：通用条件图像相似性基准

摘要

我们认为“相似性”有许多概念，模型应该能够动态地适应这些概念，就像人类一样。这与大多数表示学习方法相反，无论是监督学习还是自监督学习，这些方法学习一个固定的嵌入函数，因此隐含地假设了单一的相似性概念。例如，在ImageNet上训练的模型偏向于对象类别，而用户可能希望模型专注于颜色、纹理或场景中的特定元素。在本文中，我们提出了GeneCIS（'genesis'）基准测试，该测试衡量了模型适应各种相似性条件的能力。扩展先前的工作，我们的基准测试仅设计用于零样本评估，因此考虑了一个开放的相似性条件集。我们发现，基于强大的CLIP模型的基线在GeneCIS上表现不佳，并且基准测试的性能与ImageNet准确性之间的相关性很弱，这表明简单地扩展现有方法并不有效。我们进一步提出了一种简单、可扩展的解决方案，基于自动从现有图像-标题数据集中挖掘信息。我们发现我们的方法在GeneCIS上比基线提供了显著提升，并进一步改善了相关图像检索基准测试的零样本性能。事实上，尽管进行了零样本评估，我们的模型在MIT-States上超过了最先进的监督模型。项目页面位于https://sgvaze.github.io/genecis/。

English

We argue that there are many notions of 'similarity' and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning methods, supervised or self-supervised, which learn a fixed embedding function and hence implicitly assume a single notion of similarity. For instance, models trained on ImageNet are biased towards object categories, while a user might prefer the model to focus on colors, textures or specific elements in the scene. In this paper, we propose the GeneCIS ('genesis') benchmark, which measures models' ability to adapt to a range of similarity conditions. Extending prior work, our benchmark is designed for zero-shot evaluation only, and hence considers an open-set of similarity conditions. We find that baselines from powerful CLIP models struggle on GeneCIS and that performance on the benchmark is only weakly correlated with ImageNet accuracy, suggesting that simply scaling existing methods is not fruitful. We further propose a simple, scalable solution based on automatically mining information from existing image-caption datasets. We find our method offers a substantial boost over the baselines on GeneCIS, and further improves zero-shot performance on related image retrieval benchmarks. In fact, though evaluated zero-shot, our model surpasses state-of-the-art supervised models on MIT-States. Project page at https://sgvaze.github.io/genecis/.

GeneCIS：通用条件图像相似性基准

GeneCIS: A Benchmark for General Conditional Image Similarity

摘要

Support