시각-언어 모델 경량화를 위한 교사 모델 마스킹 및 학생 모델 강화

초록

대규모 시각-언어 모델(VLM)은 최근 놀라운 다중모드 이해 능력을 달성했지만, 그 거대한 규모로 인해 모바일이나 에지 기기에 배포하기에는 실용적이지 않습니다. 이에 따라 강력한 대형 교사 모델로부터 효율적으로 학습할 수 있는 컴팩트하면서도 우수한 성능의 VLM에 대한 필요성이 대두되고 있습니다. 그러나 대형 교사 모델의 지식을 소형 학생 모델로 전수하는 것은 두 모델 간의 큰 규모 차이로 인해 여전히 어려운 과제입니다: 학생 모델은 종종 교사 모델의 복잡하고 고차원적인 표현을 재현하지 못하여 불안정한 학습과 성능 저하로 이어집니다. 이를 해결하기 위해 우리는 마스크 점진적 강화 학습(RL) 전수 프레임워크인 Masters(Masking Teacher and Reinforcing Student)를 제안합니다. Masters는 먼저 교사 모델의 비주요 가중치를 마스킹하여 불필요한 복잡성을 줄인 다음, 훈련 과정에서 교사 모델의 역량을 점진적으로 증가시켜 복원합니다. 이 전략을 통해 학생 모델은 교사 모델로부터 더 풍부한 표현을 원활하고 안정적으로 학습할 수 있습니다. 지식 전수를 더욱 정교하게 하기 위해 Masters는 두 가지 상호 보완적인 보상과 함께 오프라인 RL 단계를 통합합니다: 생성된 응답의 정확성을 측정하는 정확도 보상과 교사 모델에서 학생 모델로 응답을 전수하는 용이성을 정량화하는 전수 보상입니다. 계산 비용이 많이 들고 장황한 응답을 생성하는 온라인 생각-답변 RL 패러다임과 달리, 우리의 오프라인 RL은 마스킹된 교사 모델로부터 사전 생성된 응답을 활용합니다. 이를 통해 풍부하면서도 효율적인 지도를 제공하여 학생 모델이 생각-답변 과정 없이도 강력한 성능을 달성할 수 있게 합니다.

English

Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers. However, distilling knowledge from a large teacher to a small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher by gradually increasing its capacity during training. This strategy allows the student to learn richer representations from the teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of transferring responses from teacher to student. Unlike online think-answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling students to achieve strong performance without requiring the think-answer process.

시각-언어 모델 경량화를 위한 교사 모델 마스킹 및 학생 모델 강화

Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

초록

Support