教師を隠し学生を強化：視覚言語モデルの蒸留に向けて

要旨

大規模視覚言語モデル（VLM）は近年、顕著なマルチモーダル理解能力を実現しているが、その巨大なサイズのためモバイルやエッジデバイスへの実用展開が困難である。このため、強力な大規模教師モデルから効率的に学習可能な、コンパクトかつ高能力なVLMの需要が高まっている。しかし、大規模教師モデルから小規模生徒モデルへの知識蒸留は、両者の大きさの差が大きいため課題が多い。生徒モデルは教師モデルの複雑で高次元な表現を再現できず、学習が不安定になり性能が低下する傾向がある。この問題に対処するため、我々はMasking Teacher and Reinforcing Student（Masters）と呼ばれる、マスク漸進的強化学習（RL）蒸留フレームワークを提案する。Mastersはまず教師モデルの非主要な重みをマスキングして不必要な複雑性を軽減し、訓練中に教師モデルの容量を段階的に回復させる。この戦略により、生徒モデルは教師モデルからより豊富な表現を滑らかかつ安定的に学習できる。さらに知識転送を洗練させるため、MastersはオフラインRL段階を統合し、2つの相補的報酬を採用する。生成応答の正確性を測る「精度報酬」と、教師から生徒への応答転送の容易さを定量化する「蒸留報酬」である。計算コストが高く長文応答を生成するオンライン思考応答RL方式とは異なり、本手法のオフラインRLはマスキングされた教師モデルから事前生成された応答を活用する。これにより、豊富かつ効率的な指導が可能となり、生徒モデルは思考応答プロセスを必要とせずに高い性能を達成できる。

English

Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers. However, distilling knowledge from a large teacher to a small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher by gradually increasing its capacity during training. This strategy allows the student to learn richer representations from the teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of transferring responses from teacher to student. Unlike online think-answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling students to achieve strong performance without requiring the think-answer process.

教師を隠し学生を強化：視覚言語モデルの蒸留に向けて

Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

要旨

Support