より多くのコンテキスト、より少ない分散：コンテキスト属性の推論と条件付けによる視覚分類

要旨

CLIPは基盤的な視覚言語モデルとして、多様な視覚概念と自然言語記述を理解する能力から、ゼロショット画像分類において広く使用されています。しかし、CLIPの前例のない人間に似た理解能力を最大限に活用し、より優れたゼロショット分類を実現する方法は依然として未解決の問題です。本論文は、人間の視覚知覚プロセスに着想を得ています。現代の神経科学の観点では、物体を分類する際に、人間はまずクラスに依存しない属性（例えば背景や向き）を推論し、これによって前景物体を背景から分離し、その後この情報に基づいて判断を下すとされています。これに着想を得て、我々は、CLIPに文脈的属性を提供することでゼロショット分類が改善され、誤った特徴への依存が軽減されることを観察しました。また、CLIP自体が画像から属性を合理的に推論できることも確認しました。これらの観察に基づき、我々はPerceptionCLIPというトレーニング不要の2段階ゼロショット分類手法を提案します。この手法では、与えられた画像に対してまず文脈的属性（例えば背景）を推論し、その後それらを条件として物体分類を行います。実験結果から、PerceptionCLIPはより優れた一般化能力、グループロバスト性、および解釈可能性を実現することが示されました。例えば、ViT-L/14を用いたPerceptionCLIPは、Waterbirdsデータセットにおいて最悪グループ精度を16.5%、CelebAにおいて3.5%向上させました。

English

CLIP, as a foundational vision language model, is widely used in zero-shot image classification due to its ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better zero-shot classification is still an open question. This paper draws inspiration from the human visual perception process: a modern neuroscience view suggests that in classifying an object, humans first infer its class-independent attributes (e.g., background and orientation) which help separate the foreground object from the background, and then make decisions based on this information. Inspired by this, we observe that providing CLIP with contextual attributes improves zero-shot classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method named PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and better interpretability. For example, PerceptionCLIP with ViT-L/14 improves the worst group accuracy by 16.5% on the Waterbirds dataset and by 3.5% on CelebA.

より多くのコンテキスト、より少ない分散：コンテキスト属性の推論と条件付けによる視覚分類

More Context, Less Distraction: Visual Classification by Inferring and Conditioning on Contextual Attributes

要旨

Support