ChatPaper.aiChatPaper

Switch-KD:面向视觉语言模型的视觉切换知识蒸馏

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

April 16, 2026
作者: Haoyi Sun, Xiaoxiao Wang, Ning Mao, Qian Wang, Lifu Mu, Wen Zheng, Tao Wei, Wei Chen
cs.AI

摘要

视觉语言模型(VLMs)在跨模态理解方面展现出卓越能力,但其大规模特性给资源受限场景下的部署带来严峻挑战。知识蒸馏(KD)提供了一种在不增加模型规模或数据需求的前提下提升模型性能的有效途径,从而显著提高部署效率。然而,将KD应用于VLMs时面临模态特异性监督的挑战:尽管VLM中的多模态知识在语言空间内融合,现有方法仍对各模态进行独立监督,未能显式解决多模态对齐问题,导致跨模态知识传递不一致。为此,我们提出Switch-KD——一种视觉切换蒸馏框架,将视觉-语言知识统一迁移至共享的文本概率空间。该框架包含两个核心组件:(1)视觉切换蒸馏机制,通过将学生模型的视觉输出切换至教师模型的语言通路,构建跨模态概率参考以实现隐式视觉知识迁移;(2)动态双向对数差异(DBiLD)损失函数,通过双向监督自适应对齐信息丰富的概率区域,同时保持师生模型的分布结构。在Switch-KD指导下,0.5B参数的TinyLLaVA无需架构改动即可从其3B教师模型中有效蒸馏丰富多模态知识,在10个多模态基准测试中平均提升3.6个性能点。
English
Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.
PDF71April 18, 2026