克服视觉语言模型微调在OOD泛化中的陷阱

摘要

现有的视觉-语言模型在各种视觉领域和任务上表现出强大的泛化能力。然而，这些模型主要以零样本识别的方式在封闭集中进行操作，因此在设计上难以处理开放域视觉概念。最近出现了一些微调方法，比如提示学习，不仅研究了在分布（ID）和非分布（OOD）样本之间的区分，而且在ID和OOD准确性方面也显示出了一些改进。在本文中，我们首先证明了视觉-语言模型在经过足够长时间的微调但缺乏适当的正则化时，往往会过拟合给定数据集中已知类别，从而在未知类别上性能下降。然后，我们提出了一种新颖的方法 OGEN 来解决这一问题，重点是改善微调模型的OOD泛化能力。具体来说，引入了一种类别条件特征生成器，可以仅使用任何未知类别的类别名称来合成OOD特征。这些合成特征将提供有关未知类别的有用知识，并在联合优化时帮助规范ID和OOD数据之间的决策边界。同样重要的是我们的自适应自蒸馏机制，用于在联合优化过程中规范我们的特征生成模型，即自适应地在模型状态之间传递知识，以进一步防止过拟合。实验证实，我们的方法在不同设置下都取得了令人信服的OOD泛化性能提升。

English

Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-language models, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature generator is introduced to synthesize OOD features using just the class name of any unknown class. Such synthesized features will provide useful knowledge about unknowns and help regularize the decision boundary between ID and OOD data when optimized jointly. Equally important is our adaptive self-distillation mechanism to regularize our feature generation model during joint optimization, i.e., adaptively transferring knowledge between model states to further prevent overfitting. Experiments validate that our method yields convincing gains in OOD generalization performance in different settings.

克服视觉语言模型微调在OOD泛化中的陷阱

Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

摘要

Support