基于半监督式基础模型蒸馏的学生专家训练方法

摘要

基础模型虽具备强大的感知能力，但常因计算负载过重而难以部署，且适应过程通常需要昂贵的标注成本。我们提出一种半监督知识蒸馏（SSKD）框架，利用有限标注数据和大量无标注数据将预训练视觉基础模型（VFM）压缩为轻量化专家模型，并在像素级标注成本极高的实例分割任务中实现该框架。该框架包含三个阶段：（1）通过自训练与对比校准进行VFM的领域适应；（2）通过统一多目标损失函数实现知识迁移；（3）学生模型精炼以缓解残余伪标签偏差。我们的方法核心在于实例感知的像素级对比损失，该损失融合掩码与分类得分以提取信息丰富的负样本并强化实例间边界。通过在适应和蒸馏阶段保持这种对比信号，我们实现了师生模型嵌入的对齐，并更高效地利用无标注图像。在Cityscapes和ADE20K数据集上，体积缩小约11倍的学生模型相较零样本VFM教师模型的AP指标提升+11.9和+8.6，较适应后的教师模型提升+3.4和+1.5 AP，并在基准测试中超越现有最先进的SSKD方法。

English

Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our approx 11times smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.