Ausbildung eines studentischen Experten durch semi-überwachte Foundation-Model-Distillation

Zusammenfassung

Foundation Models bieten eine starke Wahrnehmungsleistung, sind jedoch oft zu rechenintensiv für den Einsatz, und ihre Anpassung erfordert typischerweise kostspielige Annotationen. Wir stellen ein semi-überwachtes Knowledge-Distillation-Framework (SSKD) vor, das vortrainierte Vision Foundation Models (VFMs) unter Verwendung begrenzter gelabelter und umfangreicher ungelabelter Daten in kompakte Expertensysteme komprimiert, und instanziieren es für Instanzensegmentierung, wo Pixel-Labels besonders aufwändig sind. Das Framework gliedert sich in drei Stufen: (1) Domänenanpassung der VFM(s) durch Self-Training mit kontrastiver Kalibrierung, (2) Wissenstransfer via einem vereinheitlichten Multi-Objective-Loss und (3) Studenten-Verfeinerung zur Reduzierung verbleibender Pseudo-Label-Verzerrungen. Kern unseres Ansatzes ist ein instanzenbewusster, pixelweiser Kontrastverlust, der Masken- und Klassenscores fusioniert, um informative Negative zu extrahieren und klare Inter-Instanzen-Grenzen zu erzwingen. Durch Beibehaltung dieses kontrastiven Signals sowohl bei der Anpassung als auch bei der Distillation alignieren wir Teacher- und Student-Embeddings und nutzen ungelabelte Bilder effektiver. Auf Cityscapes und ADE20K übertrifft unser ca. 11x kleinerer Student seine Zero-Shot-VFM-Teacher(s) um +11,9 bzw. +8,6 AP, übertrifft angepasste Teacher(s) um +3,4 bzw. +1,5 AP und übertrifft state-of-the-art SSKD-Methoden in Benchmarks.

English

Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our approx 11times smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.

Ausbildung eines studentischen Experten durch semi-überwachte Foundation-Model-Distillation

Training a Student Expert via Semi-Supervised Foundation Model Distillation

Zusammenfassung

Support