克服視覺語言模型微調在OOD泛化中的陷阱
Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization
January 29, 2024
作者: Yuhang Zang, Hanlin Goh, Josh Susskind, Chen Huang
cs.AI
摘要
現有的視覺語言模型在各種視覺領域和任務上展現出強大的泛化能力。然而,這些模型主要以閉集方式執行零-shot識別,因此在設計上難以處理開放域視覺概念。最近出現了一些微調方法,如提示學習,不僅研究了在分佈(ID)和非分佈(OOD)樣本之間的區分,而且在ID和OOD的準確性方面也表現出一定的改進。在本文中,我們首先證明了視覺語言模型在經過足夠長時間的微調後,若沒有適當的正則化,往往會過度擬合給定數據集中已知類別,導致對未知類別性能下降。然後,我們提出了一種新方法OGEN來解決這個問題,主要集中在改善微調模型的OOD泛化。具體來說,引入了一個類條件特徵生成器,用於僅使用任何未知類別的類名來合成OOD特徵。這些合成特徵將提供有關未知類別的有用知識,並有助於在聯合優化時規範ID和OOD數據之間的決策邊界。同樣重要的是我們的自適應自我蒸餾機制,在聯合優化期間規範我們的特徵生成模型,即在模型狀態之間自適應地轉移知識,以進一步防止過度擬合。實驗驗證了我們的方法在不同設置下在OOD泛化性能方面取得了令人信服的進展。
English
Existing vision-language models exhibit strong generalization on a variety of
visual domains and tasks. However, such models mainly perform zero-shot
recognition in a closed-set manner, and thus struggle to handle open-domain
visual concepts by design. There are recent finetuning methods, such as prompt
learning, that not only study the discrimination between in-distribution (ID)
and out-of-distribution (OOD) samples, but also show some improvements in both
ID and OOD accuracies. In this paper, we first demonstrate that vision-language
models, after long enough finetuning but without proper regularization, tend to
overfit the known classes in the given dataset, with degraded performance on
unknown classes. Then we propose a novel approach OGEN to address this pitfall,
with the main focus on improving the OOD GENeralization of finetuned models.
Specifically, a class-conditional feature generator is introduced to synthesize
OOD features using just the class name of any unknown class. Such synthesized
features will provide useful knowledge about unknowns and help regularize the
decision boundary between ID and OOD data when optimized jointly. Equally
important is our adaptive self-distillation mechanism to regularize our feature
generation model during joint optimization, i.e., adaptively transferring
knowledge between model states to further prevent overfitting. Experiments
validate that our method yields convincing gains in OOD generalization
performance in different settings.