Switch-KD: 비전-언어 모델을 위한 시각적 전환 지식 증류

초록

비전-언어 모델(VLM)은 시각-언어 통합 이해에서 뛰어난 능력을 보여주지만, 대규모 모델 규모로 인해 자원이 제한된 환경에서의 배포에는 상당한 어려움이 따릅니다. 지식 증류(KD)는 모델 크기나 데이터 요구량을 증가시키지 않으면서 모델 성능을 향상시켜 배포 효율성을 높이는 실용적인 방법을 제공합니다. 그러나 VLM에 KD를 적용할 때는 모달리티별 지도(supervision)의 한계에 직면합니다: VLM의 다중모달 지식이 언어 공간 내에서 융합됨에도 불구하고, 기존 방법은 각 모달리티를 개별적으로 지도하여 다중모달 정렬을 명시적으로 다루지 않아 일관성 없는 다중모달 지식 전이를 초래합니다. 이를 해결하기 위해 우리는 공유 텍스트-확률 공간 내에서 시각-언어 지식 전이를 통합하는 시각 스위치 증류 프레임워크인 Switch-KD를 제안합니다. Switch-KD는 두 가지 핵심 구성 요소로 이루어집니다: (1) 학생 모델의 시각 출력을 교사 모델의 언어 경로로 전환하여 암묵적 시각 지식 전이를 위한 교차 모달 확률 참조를 구성하는 Visual-Switch Distillation과, (2) 양방향 지도를 통해 교사와 학생 모델의 분포 구조를 보존하면서 정보량이 높은 확률 영역을 적응적으로 정렬하는 Dynamic Bi-directional Logits Difference(DBiLD) 손실 함수입니다. Switch-KD의 지도 하에, 0.5B 규모의 TinyLLaVA는 아키텍처 수정 없이 3B 규모의 교사 모델로부터 풍부한 다중모달 지식을 효과적으로 증류하여 10가지 다중모달 벤치마크에서 평균 3.6점의 성능 향상을 달성했습니다.

English

Vision-Language Models (VLMs) have shown remarkable capabilities in joint vision-language understanding, but their large scale poses significant challenges for deployment in resource-constrained scenarios. Knowledge Distillation (KD) offers a viable way to improve model capabilities without increasing model size or data requirements, making deployment more efficient. However, applying KD to VLMs is challenged by modality-specific supervision: although multimodal knowledge in VLMs is fused within the language space, current methods supervise each modality separately without explicitly addressing multimodal alignment, leading to inconsistent multimodal knowledge transfer. To address this, we propose Switch-KD, a visual-switch distillation framework that unifies vision-language knowledge transfer within a shared text-probability space. Switch-KD comprises two key components: (1) Visual-Switch Distillation, which switches the student's visual outputs into the teacher's language pathway to construct cross-modal probabilistic references for implicit visual knowledge transfer; and (2) Dynamic Bi-directional Logits Difference (DBiLD) loss, which adaptively aligns informative probability regions while preserving the distributional structures of teacher and student through bidirectional supervision. Guided by Switch-KD, a 0.5B TinyLLaVA effectively distills rich multimodal knowledge from its 3B teacher, yielding an average improvement of 3.6 points across 10 multimodal benchmarks without any architectural modification.

Switch-KD: 비전-언어 모델을 위한 시각적 전환 지식 증류

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

초록

Support