通过半监督式基础模型蒸馏培养专业学生

摘要

基础模型虽具备强大的感知能力，但常因计算负载过高而难以部署，且其适配通常需要昂贵的标注成本。本文提出一种半监督知识蒸馏（SSKD）框架，通过有限标注数据和大量无标注数据将预训练视觉基础模型（VFM）压缩为轻量化专家模型，并以像素级标注成本极高的实例分割任务为例进行实现。该框架包含三个阶段：（1）通过对比校准的自训练实现VFM的领域自适应；（2）采用统一多目标损失函数完成知识迁移；（3）通过学生模型精炼缓解残余伪标签偏差。该方法的核心理念是引入实例感知的像素级对比损失，融合掩码与分类得分以提取信息丰富的负样本并强化实例间边界。通过将对比信号持续应用于自适应和蒸馏过程，我们实现了师生模型嵌入空间的对齐，从而更高效地利用无标注图像。在Cityscapes和ADE20K数据集上，体积缩小约11倍的学生模型相较零样本VFM教师模型的AP指标分别提升11.9和8.6，较领域自适应后的教师模型提升3.4和1.5 AP，并在基准测试中超越现有最先进的SSKD方法。

English

Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our approx 11times smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.

通过半监督式基础模型蒸馏培养专业学生

Training a Student Expert via Semi-Supervised Foundation Model Distillation

摘要

Support