비전-언어 모델의 특이값 기반 소샷 적응

초록

CLIP과 같은 시각-언어 모델(VLMs)은 다양한 응용 분야에서 인상적인 제로샷 및 퓨샷 학습 능력을 보여주었습니다. 그러나 이러한 모델을 새로운 세분화된 도메인에 적응시키는 것은 프롬프트 엔지니어링에 대한 의존성과 전체 모델 미세 조정의 높은 비용으로 인해 여전히 어려운 과제입니다. 기존의 적응 접근 방식은 프롬프트 토큰 및 어댑터 모듈과 같은 추가 구성 요소에 의존하는데, 이는 적응 품질을 제한하고 모델을 불안정하게 만들며 사전 학습 중 습득한 풍부한 지식을 훼손할 수 있습니다. 본 연구에서는 CLIP-SVD를 제안합니다. 이는 특이값 분해(SVD)를 활용하여 추가 모듈을 주입하지 않고 CLIP의 내부 매개변수 공간을 수정하는 새로운 다중 모달 및 매개변수 효율적 적응 기술입니다. 구체적으로, 우리는 CLIP 매개변수 행렬의 특이값만 미세 조정하여 사전 학습된 모델을 유지하면서 도메인 적응을 위한 기저 벡터를 재조정합니다. 이 설계는 모델 전체 매개변수의 0.04%만 사용하여 향상된 적응 성능과 더 나은 일반화 능력을 보존할 수 있게 합니다. CLIP-SVD는 11개의 자연 데이터셋과 10개의 생물의학 데이터셋에서 최신 분류 결과를 달성하며, 퓨샷 설정에서 정확도와 일반화 측면에서 이전 방법들을 능가합니다. 또한, 우리는 자연 언어 기반 접근 방식을 활용하여 CLIP 적응의 효과와 동적 특성을 분석함으로써 CLIP-SVD의 해석 가능성을 제공합니다. 코드는 https://github.com/HealthX-Lab/CLIP-SVD에서 공개되어 있습니다.

English

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a novel multi-modal and parameter-efficient adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04\% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

비전-언어 모델의 특이값 기반 소샷 적응

Singular Value Few-shot Adaptation of Vision-Language Models

초록

Support