更多上下文,更少干扰:通过推断和基于上下文属性的条件分类
More Context, Less Distraction: Visual Classification by Inferring and Conditioning on Contextual Attributes
August 2, 2023
作者: Bang An, Sicheng Zhu, Michael-Andrei Panaitescu-Liess, Chaithanya Kumar Mummadi, Furong Huang
cs.AI
摘要
作为一种基础视觉语言模型,CLIP因其理解各种视觉概念和自然语言描述的能力而被广泛应用于零样本图像分类。然而,如何充分利用CLIP前所未有的类人理解能力来实现更好的零样本分类仍然是一个悬而未决的问题。本文从人类视觉感知过程中汲取灵感:现代神经科学观点表明,在对对象进行分类时,人类首先推断其与类别无关的属性(例如背景和方向),这有助于将前景对象与背景分离,然后基于这些信息做出决策。受此启发,我们观察到,为CLIP提供上下文属性可以改善零样本分类,并减轻对虚假特征的依赖。我们还观察到,CLIP本身可以合理地从图像中推断出这些属性。基于这些观察,我们提出了一种名为PerceptionCLIP的无需训练的两步零样本分类方法。给定一幅图像,它首先推断出上下文属性(例如背景),然后在此基础上执行对象分类。我们的实验表明,PerceptionCLIP实现了更好的泛化性、组鲁棒性和更好的可解释性。例如,使用ViT-L/14的PerceptionCLIP在Waterbirds数据集上将最差组准确率提高了16.5%,在CelebA数据集上提高了3.5%。
English
CLIP, as a foundational vision language model, is widely used in zero-shot
image classification due to its ability to understand various visual concepts
and natural language descriptions. However, how to fully leverage CLIP's
unprecedented human-like understanding capabilities to achieve better zero-shot
classification is still an open question. This paper draws inspiration from the
human visual perception process: a modern neuroscience view suggests that in
classifying an object, humans first infer its class-independent attributes
(e.g., background and orientation) which help separate the foreground object
from the background, and then make decisions based on this information.
Inspired by this, we observe that providing CLIP with contextual attributes
improves zero-shot classification and mitigates reliance on spurious features.
We also observe that CLIP itself can reasonably infer the attributes from an
image. With these observations, we propose a training-free, two-step zero-shot
classification method named PerceptionCLIP. Given an image, it first infers
contextual attributes (e.g., background) and then performs object
classification conditioning on them. Our experiments show that PerceptionCLIP
achieves better generalization, group robustness, and better interpretability.
For example, PerceptionCLIP with ViT-L/14 improves the worst group accuracy by
16.5% on the Waterbirds dataset and by 3.5% on CelebA.