더 많은 컨텍스트, 더 적은 방해: 컨텍스트 속성을 추론하고 조건화하여 시각적 분류 수행

초록

CLIP은 기초적인 비전-언어 모델로서, 다양한 시각적 개념과 자연어 설명을 이해할 수 있는 능력 덕분에 제로샷 이미지 분류에서 널리 사용되고 있습니다. 그러나 CLIP의 전례 없는 인간 수준의 이해 능력을 최대한 활용하여 더 나은 제로샷 분류를 달성하는 방법은 여전히 미해결 과제로 남아 있습니다. 본 논문은 인간의 시각 인지 과정에서 영감을 얻었습니다: 현대 신경과학의 관점에 따르면, 인간은 물체를 분류할 때 먼저 클래스와 무관한 속성(예: 배경과 방향)을 추론하여 전경 물체를 배경과 분리한 후, 이 정보를 바탕으로 결정을 내립니다. 이를 바탕으로, 우리는 CLIP에 문맥적 속성을 제공하면 제로샷 분류가 개선되고 허위 특징에 대한 의존성이 완화된다는 것을 관찰했습니다. 또한 CLIP 자체가 이미지로부터 이러한 속성을 합리적으로 추론할 수 있다는 점도 확인했습니다. 이러한 관찰을 바탕으로, 우리는 PerceptionCLIP이라는 훈련이 필요 없는 두 단계의 제로샷 분류 방법을 제안합니다. 이 방법은 주어진 이미지에 대해 먼저 문맥적 속성(예: 배경)을 추론한 후, 이를 조건으로 하여 물체 분류를 수행합니다. 우리의 실험 결과, PerceptionCLIP은 더 나은 일반화, 그룹 견고성, 그리고 더 나은 해석 가능성을 달성했습니다. 예를 들어, ViT-L/14를 사용한 PerceptionCLIP은 Waterbirds 데이터셋에서 최악의 그룹 정확도를 16.5% 향상시켰고, CelebA에서는 3.5% 향상시켰습니다.

English

CLIP, as a foundational vision language model, is widely used in zero-shot image classification due to its ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better zero-shot classification is still an open question. This paper draws inspiration from the human visual perception process: a modern neuroscience view suggests that in classifying an object, humans first infer its class-independent attributes (e.g., background and orientation) which help separate the foreground object from the background, and then make decisions based on this information. Inspired by this, we observe that providing CLIP with contextual attributes improves zero-shot classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method named PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and better interpretability. For example, PerceptionCLIP with ViT-L/14 improves the worst group accuracy by 16.5% on the Waterbirds dataset and by 3.5% on CelebA.

더 많은 컨텍스트, 더 적은 방해: 컨텍스트 속성을 추론하고 조건화하여 시각적 분류 수행

More Context, Less Distraction: Visual Classification by Inferring and Conditioning on Contextual Attributes

초록

Support