基于双头优化的视觉-语言模型简易半监督知识蒸馏
Simple Semi-supervised Knowledge Distillation from Vision-Language Models via texttt{D}ual-texttt{H}ead texttt{O}ptimization
May 12, 2025
作者: Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang
cs.AI
摘要
视觉-语言模型(VLMs)通过利用丰富的文本信息,在少量标注数据的情况下,已在多种任务中取得了显著成功。然而,在资源受限的环境中部署这类大型模型仍面临挑战。知识蒸馏(KD)为此提供了一个成熟的解决方案;然而,近期针对VLMs的KD方法往往涉及多阶段训练或额外调优,增加了计算开销和优化复杂度。本文提出了一种名为\texttt{D}ual-\texttt{H}ead \texttt{O}ptimization(\texttt{DHO})的简单而有效的KD框架,该框架在半监督设置下将知识从VLMs迁移至紧凑的任务特定模型。具体而言,我们引入了双预测头,分别从标注数据和教师预测中独立学习,并建议在推理时线性结合它们的输出。我们观察到,DHO缓解了监督信号与蒸馏信号之间的梯度冲突,相比单头KD基线,实现了更有效的特征学习。因此,大量实验表明,DHO在多个领域和细粒度数据集上均持续超越基线方法。特别是在ImageNet上,它达到了最先进的性能,在使用1%和10%标注数据时,分别将准确率提升了3%和0.1%,同时使用了更少的参数。
English
Vision-language models (VLMs) have achieved remarkable success across diverse
tasks by leveraging rich textual information with minimal labeled data.
However, deploying such large models remains challenging, particularly in
resource-constrained environments. Knowledge distillation (KD) offers a
well-established solution to this problem; however, recent KD approaches from
VLMs often involve multi-stage training or additional tuning, increasing
computational overhead and optimization complexity. In this paper, we propose
texttt{D}ual-texttt{H}ead
texttt{O}ptimization (texttt{DHO}) -- a simple yet
effective KD framework that transfers knowledge from VLMs to compact,
task-specific models in semi-supervised settings. Specifically, we introduce
dual prediction heads that independently learn from labeled data and teacher
predictions, and propose to linearly combine their outputs during inference. We
observe that DHO mitigates gradient conflicts between supervised and
distillation signals, enabling more effective feature learning than single-head
KD baselines. As a result, extensive experiments show that DHO
consistently outperforms baselines across multiple domains and fine-grained
datasets. Notably, on ImageNet, it achieves state-of-the-art performance,
improving accuracy by 3% and 0.1% with 1% and 10% labeled data, respectively,
while using fewer parameters.Summary
AI-Generated Summary