PCoreSet：通過視覺-語言模型知識蒸餾實現高效主動學習

摘要

知識蒸餾（KD）是一種廣泛使用的框架，旨在通過利用教師模型的知識來訓練緊湊、任務特定的模型。然而，其在主動學習（AL）中的應用，即通過迭代樣本選擇來最小化註釋成本，仍未被充分探索。這一差距源於KD通常假設可以獲得足夠的標記數據，而AL則在數據稀缺的場景中運作，這些場景中任務特定的教師模型往往不可用。本文介紹了ActiveKD，這是一個將AL與KD相結合的框架，利用大型視覺語言模型（VLMs）的零樣本和少樣本能力。ActiveKD的一個關鍵方面是VLMs的結構化預測偏差——即它們的預測在概率空間中形成聚類。我們將這種結構視為教師模型的歸納偏差，捕捉對學生學習有益的可泛化輸出模式。為了利用這一偏差，我們提出了概率核心集（PCoreSet），這是一種在概率空間而非特徵空間中最大化覆蓋範圍的選擇策略。PCoreSet策略性地選擇類別多樣的未標記樣本，從而在有限的註釋預算下更有效地傳遞教師知識。在11個數據集上的評估表明，PCoreSet在ActiveKD框架內始終優於現有的選擇方法，推動了AL與KD交叉領域的研究進展。

English

Knowledge distillation (KD) is a widely used framework for training compact, task-specific models by leveraging the knowledge of teacher models. However, its application to active learning (AL), which aims to minimize annotation costs through iterative sample selection, remains underexplored. This gap stems from the fact that KD typically assumes access to sufficient labeled data, whereas AL operates in data-scarce scenarios where task-specific teacher models are often unavailable. In this paper, we introduce ActiveKD, a framework that integrates AL with KD by leveraging the zero- and few-shot capabilities of large vision-language models (VLMs). A key aspect of ActiveKD is the structured prediction bias of VLMs -- i.e., their predictions form clusters in the probability space. We regard this structure as an inductive bias of the teacher model, capturing generalizable output patterns beneficial to student learning. To exploit this bias, we propose Probabilistic CoreSet (PCoreSet), a selection strategy that maximizes coverage in the probability space rather than the feature space. PCoreSet strategically selects categorically diverse unlabeled samples, facilitating more efficient transfer of teacher knowledge under limited annotation budgets. Evaluations on 11 datasets show that PCoreSet consistently outperforms existing selection methods within the ActiveKD framework, advancing research at the intersection of AL and KD.

PCoreSet：通過視覺-語言模型知識蒸餾實現高效主動學習

PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models

摘要

Support