ChatPaper.aiChatPaper

Switch-KD:面向視覺語言模型的視覺切換知識蒸餾

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

April 16, 2026
作者: Haoyi Sun, Xiaoxiao Wang, Ning Mao, Qian Wang, Lifu Mu, Wen Zheng, Tao Wei, Wei Chen
cs.AI

摘要

視覺語言模型(VLMs)在跨模態理解方面展現出卓越能力,但其大規模特性為資源受限場景的部署帶來重大挑戰。知識蒸餾(KD)提供了一種可行的解決方案,能在不增加模型規模或數據需求的前提下提升模型性能,使部署更高效。然而,將KD應用於VLM時面臨模態特定監督的難題:儘管VLM中的多模態知識在語言空間內融合,現有方法仍對各模態進行單獨監督,未能顯式處理多模態對齊,導致跨模態知識傳遞不一致。為此,我們提出Switch-KD——一種視覺切換蒸餾框架,將視覺語言知識傳輸統一於共享的文本概率空間。該框架包含兩個核心組件:(1)視覺切換蒸餾機制,通過將學生的視覺輸出切換至教師的語言通路,構建跨模態概率參考以實現隱式視覺知識傳輸;(2)動態雙向對數差異(DBiLD)損失函數,通過雙向監督自適應對齊信息豐富的概率區域,同時保持師生模型的分佈結構。在Switch-KD指導下,參數量僅0.5B的TinyLLaVA成功從其3B教師模型中蒸餾出豐富的多模態知識,在10個多模態基準測試中平均提升3.6個百分點,且無需任何架構修改。
English
Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.
PDF71April 18, 2026