ChatPaper.aiChatPaper

更多上下文,少一些干擾:通過推斷和條件化上下文屬性進行視覺分類

More Context, Less Distraction: Visual Classification by Inferring and Conditioning on Contextual Attributes

August 2, 2023
作者: Bang An, Sicheng Zhu, Michael-Andrei Panaitescu-Liess, Chaithanya Kumar Mummadi, Furong Huang
cs.AI

摘要

作為基礎視覺語言模型,CLIP廣泛應用於零樣本圖像分類,因其能夠理解各種視覺概念和自然語言描述。然而,如何充分利用CLIP卓越的類人理解能力來實現更好的零樣本分類仍然是一個未解之謎。本文從人類視覺知覺過程中汲取靈感:一種現代神經科學觀點認為,在對物體進行分類時,人類首先推斷其與類別無關的屬性(例如背景和方向),這有助於將前景物體與背景區分開來,然後基於此信息做出決策。受此啟發,我們觀察到為CLIP提供上下文屬性可以改善零樣本分類並減輕對偶發特徵的依賴。我們還觀察到,CLIP本身可以合理地從圖像中推斷這些屬性。基於這些觀察,我們提出了一種名為PerceptionCLIP的無需訓練的零樣本分類方法,分為兩步:首先推斷上下文屬性(例如背景),然後在此基礆上進行對象分類。我們的實驗表明,PerceptionCLIP實現了更好的泛化性、組別韌性和更好的可解釋性。例如,搭配ViT-L/14的PerceptionCLIP在Waterbirds數據集上將最差組別準確率提高了16.5%,在CelebA數據集上提高了3.5%。
English
CLIP, as a foundational vision language model, is widely used in zero-shot image classification due to its ability to understand various visual concepts and natural language descriptions. However, how to fully leverage CLIP's unprecedented human-like understanding capabilities to achieve better zero-shot classification is still an open question. This paper draws inspiration from the human visual perception process: a modern neuroscience view suggests that in classifying an object, humans first infer its class-independent attributes (e.g., background and orientation) which help separate the foreground object from the background, and then make decisions based on this information. Inspired by this, we observe that providing CLIP with contextual attributes improves zero-shot classification and mitigates reliance on spurious features. We also observe that CLIP itself can reasonably infer the attributes from an image. With these observations, we propose a training-free, two-step zero-shot classification method named PerceptionCLIP. Given an image, it first infers contextual attributes (e.g., background) and then performs object classification conditioning on them. Our experiments show that PerceptionCLIP achieves better generalization, group robustness, and better interpretability. For example, PerceptionCLIP with ViT-L/14 improves the worst group accuracy by 16.5% on the Waterbirds dataset and by 3.5% on CelebA.
PDF80December 15, 2024