視覚言語モデルの特異値少ショット適応

要旨

CLIPのような視覚言語モデル（VLM）は、多様なアプリケーションにおいて印象的なゼロショットおよび少数ショット学習能力を示している。しかし、これらのモデルを新しい細粒度のドメインに適応させることは、プロンプトエンジニアリングへの依存やモデル全体のファインチューニングの高コストにより依然として困難である。既存の適応手法は、プロンプトトークンやアダプタモジュールなどの拡張コンポーネントに依存しており、適応品質を制限し、モデルを不安定にし、事前学習中に獲得された豊富な知識を損なう可能性がある。本研究では、CLIP-SVDを提案する。これは、特異値分解（SVD）を活用してCLIPの内部パラメータ空間を変更し、追加モジュールを注入することなく多モーダルかつパラメータ効率的な適応を実現する新しい手法である。具体的には、CLIPパラメータ行列の特異値のみをファインチューニングし、基底ベクトルを再スケーリングしてドメイン適応を行う一方で、事前学習済みモデルを保持する。この設計により、モデルの総パラメータのわずか0.04％を使用して適応性能を向上させ、その汎化能力をより良く保持することが可能となる。CLIP-SVDは、11の自然データセットと10の生物医学データセットにおいて、従来の手法を上回る分類精度と少数ショット設定下での汎化性能を達成し、最先端の結果を示す。さらに、自然言語ベースのアプローチを活用してCLIP適応の有効性と動態を分析し、CLIP-SVDの解釈可能性を可能にする。コードはhttps://github.com/HealthX-Lab/CLIP-SVDで公開されている。

English

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a novel multi-modal and parameter-efficient adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04\% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

視覚言語モデルの特異値少ショット適応

Singular Value Few-shot Adaptation of Vision-Language Models

要旨

Support