遮蔽教师与强化学生:视觉语言模型的蒸馏新法
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
December 23, 2025
作者: Byung-Kwan Lee, Yu-Chiang Frank Wang, Ryo Hachiuma
cs.AI
摘要
大规模视觉语言模型(VLMs)近期在多模态理解领域取得显著突破,但其庞大的参数量使其难以部署于移动或边缘设备。这催生了对紧凑型高性能VLMs的需求,此类模型需能从强大的大型教师模型中高效学习。然而,由于师生模型间的巨大规模差异,知识蒸馏仍面临挑战:学生模型往往难以复现教师模型复杂的高维表征,导致训练不稳定与性能衰退。为此,我们提出Masters(掩码教师强化学生)框架——一种基于掩码渐进式强化学习的蒸馏方法。该框架首先掩蔽教师模型的非主导权重以降低冗余复杂度,随后在训练过程中逐步恢复教师模型容量。这种策略能让学生模型以平滑稳定的方式从教师模型中学习更丰富的表征。为优化知识迁移,Masters融合了离线强化学习阶段,包含两项互补奖励:衡量生成响应准确度的精度奖励,以及量化师生间响应迁移难易度的蒸馏奖励。与计算成本高昂且生成冗长响应的在线思维-应答强化学习范式不同,我们的离线强化学习利用掩码教师预生成的响应。这些响应能提供丰富而高效的指导,使学生模型无需经过思维-应答流程即可实现强劲性能。
English
Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers. However, distilling knowledge from a large teacher to a small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher by gradually increasing its capacity during training. This strategy allows the student to learn richer representations from the teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of transferring responses from teacher to student. Unlike online think-answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling students to achieve strong performance without requiring the think-answer process.