OOD 일반화를 위한 시각-언어 모델 파인튜닝의 함정 극복

초록

기존의 시각-언어 모델들은 다양한 시각 도메인과 작업에서 강력한 일반화 능력을 보여준다. 그러나 이러한 모델들은 주로 폐쇄 집합 방식으로 제로샷 인식을 수행하므로, 설계 상 개방 도메인 시각 개념을 처리하는 데 어려움을 겪는다. 최근에는 프롬프트 학습과 같은 미세 조정 방법들이 등장하여, 분포 내(ID)와 분포 외(OOD) 샘플 간의 차이를 연구할 뿐만 아니라 ID와 OOD 정확도 모두에서 일부 개선을 보여주고 있다. 본 논문에서는 먼저 충분히 오랜 미세 조정을 거쳤지만 적절한 정규화가 없는 시각-언어 모델들이 주어진 데이터셋의 알려진 클래스에 과적합되는 경향이 있으며, 이로 인해 알려지지 않은 클래스에 대한 성능이 저하된다는 것을 보여준다. 그런 다음, 이러한 문제를 해결하기 위해 OGEN이라는 새로운 접근 방식을 제안한다. 이 방법은 미세 조정된 모델의 OOD 일반화 성능을 개선하는 데 주안점을 둔다. 구체적으로, 클래스 조건부 특징 생성기를 도입하여 알려지지 않은 클래스의 클래스 이름만을 사용하여 OOD 특징을 합성한다. 이러한 합성된 특징은 알려지지 않은 클래스에 대한 유용한 지식을 제공하고, ID와 OOD 데이터 간의 결정 경계를 정규화하는 데 도움을 줄 것이다. 또한, 공동 최적화 과정에서 특징 생성 모델을 정규화하기 위한 적응형 자기 지식 증류 메커니즘을 도입하여, 모델 상태 간의 지식을 적응적으로 전달함으로써 과적합을 더욱 방지한다. 실험 결과, 우리의 방법은 다양한 설정에서 OOD 일반화 성능에서 설득력 있는 향상을 보여준다.

English

Existing vision-language models exhibit strong generalization on a variety of visual domains and tasks. However, such models mainly perform zero-shot recognition in a closed-set manner, and thus struggle to handle open-domain visual concepts by design. There are recent finetuning methods, such as prompt learning, that not only study the discrimination between in-distribution (ID) and out-of-distribution (OOD) samples, but also show some improvements in both ID and OOD accuracies. In this paper, we first demonstrate that vision-language models, after long enough finetuning but without proper regularization, tend to overfit the known classes in the given dataset, with degraded performance on unknown classes. Then we propose a novel approach OGEN to address this pitfall, with the main focus on improving the OOD GENeralization of finetuned models. Specifically, a class-conditional feature generator is introduced to synthesize OOD features using just the class name of any unknown class. Such synthesized features will provide useful knowledge about unknowns and help regularize the decision boundary between ID and OOD data when optimized jointly. Equally important is our adaptive self-distillation mechanism to regularize our feature generation model during joint optimization, i.e., adaptively transferring knowledge between model states to further prevent overfitting. Experiments validate that our method yields convincing gains in OOD generalization performance in different settings.

OOD 일반화를 위한 시각-언어 모델 파인튜닝의 함정 극복

Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization

초록

Support