Switch-KD: 視覚的切り替えによる視覚言語モデルの知識蒸留

要旨

Vision-Language Models（VLM）は視覚言語統合理解において顕著な能力を示すが、その大規模性はリソース制約のある環境での展開に重大な課題をもたらす。知識蒸留（KD）はモデルサイズやデータ要件を増加させることなくモデル能力を向上させる有効な手法であり、展開効率を高める。しかし、VLMへのKD適用はモダリティ固有の監督に課題がある：VLMのマルチモーダル知識は言語空間内で融合されているにもかかわらず、既存手法は各モダリティを個別に監督し、マルチモーダル整合性を明示的に扱わないため、一貫性のないマルチモーダル知識転移が生じる。この問題に対処するため、本論文ではテキスト確率空間内で視覚言語知識転移を統合するVisual-Switch蒸留フレームワーク「Switch-KD」を提案する。Switch-KDは二つの核心要素で構成される：（1）生徒モデルの視覚出力を教師モデルの言語経路に切替えることで暗黙的視覚知識転移のためのクロスモーダル確率参照を構築するVisual-Switch Distillation、（2）情報量の多い確率領域を適応的に整合させつつ双方向監督により教師・生徒の分布構造を保存するDynamic Bi-directional Logits Difference（DBiLD）損失である。Switch-KDの指導により、0.5BパラメータのTinyLLaVAが3B教師モデルから豊富なマルチモーダル知識を効果的に蒸留し、アーキテクチャ変更なしで10種のマルチモーダルベンチマークにおいて平均3.6ポイントの改善を達成した。

English

Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.

Switch-KD: 視覚的切り替えによる視覚言語モデルの知識蒸留

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

要旨

Support