비전-언어 모델로부터의 단순 준지도 지식 증류: 이중 헤드 최적화를 통한 접근

초록

비전-언어 모델(Vision-Language Models, VLMs)은 최소한의 레이블 데이터로도 풍부한 텍스트 정보를 활용하여 다양한 작업에서 뛰어난 성과를 달성해 왔다. 그러나 이러한 대규모 모델을 배포하는 것은 특히 자원이 제한된 환경에서 여전히 어려운 과제로 남아 있다. 지식 증류(Knowledge Distillation, KD)는 이 문제에 대한 잘 정립된 해결책을 제공하지만, 최근의 VLM 기반 KD 접근법은 다단계 학습 또는 추가 튜닝을 필요로 하여 계산 오버헤드와 최적화 복잡성을 증가시킨다. 본 논문에서는 반지도 학습 환경에서 VLM의 지식을 컴팩트한 작업 특화 모델로 전달하는 간단하면서도 효과적인 KD 프레임워크인 \texttt{D}ual-\texttt{H}ead \texttt{O}ptimization (\texttt{DHO})를 제안한다. 구체적으로, 레이블 데이터와 교사 모델의 예측으로부터 독립적으로 학습하는 이중 예측 헤드를 도입하고, 추론 시 이들의 출력을 선형적으로 결합하는 방식을 제안한다. DHO는 지도 학습 신호와 증류 신호 간의 그래디언트 충돌을 완화하여 단일 헤드 KD 베이스라인보다 더 효과적인 특징 학습을 가능하게 한다. 그 결과, 광범위한 실험을 통해 DHO가 여러 도메인과 세분화된 데이터셋에서 일관되게 베이스라인을 능가함을 확인하였다. 특히, ImageNet에서 1% 및 10%의 레이블 데이터로 각각 3%와 0.1%의 정확도 향상을 달성하면서도 더 적은 매개변수를 사용하여 최신 기술 수준의 성능을 달성하였다.

English

Vision-language models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data. However, deploying such large models remains challenging, particularly in resource-constrained environments. Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-stage training or additional tuning, increasing computational overhead and optimization complexity. In this paper, we propose texttt{D}ual-texttt{H}ead texttt{O}ptimization (texttt{DHO}) -- a simple yet effective KD framework that transfers knowledge from VLMs to compact, task-specific models in semi-supervised settings. Specifically, we introduce dual prediction heads that independently learn from labeled data and teacher predictions, and propose to linearly combine their outputs during inference. We observe that DHO mitigates gradient conflicts between supervised and distillation signals, enabling more effective feature learning than single-head KD baselines. As a result, extensive experiments show that DHO consistently outperforms baselines across multiple domains and fine-grained datasets. Notably, on ImageNet, it achieves state-of-the-art performance, improving accuracy by 3% and 0.1% with 1% and 10% labeled data, respectively, while using fewer parameters.

비전-언어 모델로부터의 단순 준지도 지식 증류: 이중 헤드 최적화를 통한 접근

Simple Semi-supervised Knowledge Distillation from Vision-Language Models via texttt{D}ual-texttt{H}ead texttt{O}ptimization

초록

Support