ChatPaper.aiChatPaper

提示:策略蒸馏中的令牌重要性

TIP: Token Importance in On-Policy Distillation

April 15, 2026
作者: Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard
cs.AI

摘要

在策略蒸馏(OPD)通过教师模型的词元级监督,基于学生模型自身生成的序列进行训练。不同词元位置的重要性存在差异,但现有对词元重要性的认知尚不完善。我们直指核心问题:在OPD中哪些词元承载着最有效的学习信号?研究发现,信息量丰富的词元来源于两个区域:学生模型熵值较高的位置,以及学生模型熵值较低但师生分歧度较高的位置(即学生模型过度自信却判断错误的情形)。 实证表明,学生模型熵值可作为强效的一阶代理指标:基于熵值采样保留50%的词元进行训练,其效果等同或优于全词元训练,同时峰值内存占用降低达47%。但仅依赖熵值会遗漏第二个关键区域。当我们单独提取低熵值-高分歧度的词元时,仅使用不足10%的词元进行训练即可接近全词元基线效果,这证明过度自信的词元虽在纯熵值规则下几乎不可见,却蕴含着密集的纠错信号。 基于这些发现,我们提出TIP(在策略蒸馏中的词元重要性)框架——一个以学生模型熵值和师生分歧度为双轴构建的分类体系,并从理论层面阐释了熵值有效但存在结构局限的原因。该视角催生了结合不确定性与分歧度的类型感知词元选择规则。我们在Qwen3、Llama和Qwen2.5构成的三个师生模型组合上进行了验证,测试集涵盖MATH-500和AIME 2024/2025,并在面向长程智能体规划的DeepPlanning基准测试中取得突破:仅使用不足20%的Q3词元进行训练即可超越全词元OPD效果。本实验通过扩展OPD代码库(https://github.com/HJSang/OPSD_OnPolicyDistillation)实现,该库支持在有限GPU预算下对大型模型进行内存高效的蒸馏训练。
English
On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining 50% of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to 47%. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than 10% of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on <20% of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.
PDF101April 17, 2026