蒸馏博弈：自适应攻击与高效防御

摘要

蒸馏攻击为模型提供者带来了一种部署权衡：使模型更有用的输出，也可能使其更容易被模仿。我们通过一个受限效用的教师模型与自适应学生模型之间的极小极大博弈来研究这一权衡。该框架产生了可操作的单边响应规则：一种自适应评估规则，学生在此规则下对高价值样本进行重新加权；以及一种教师端防御模板，可抑制对蒸馏最有用的输出。通过一个廉价的价值代理指标，我们推导出专家乘积（Product-of-Experts，PoE）——一种简单的仅需前向传播的防御方法，在生成过程中将教师模型与代理学生模型相结合。实验表明，自适应评估揭示出被动的评估与自适应评估之间存在巨大差距：在最先进的防御方法上，自适应学生模型在GSM8K和MATH数据集上恢复的能力远超被动评估所显示的水平。在这种更强的评估下，昂贵防御方法与PoE之间明显的鲁棒性差距显著缩小，而PoE仍保持低廉的成本，并生成更高质量的推理链。总体而言，我们的结果表明，强力蒸馏仍难以阻止，且反蒸馏进展应依据自适应学生而非被动学生进行评判。我们的代码已开源：https://github.com/ysfalh/distillation-game。

English

Distillation attacks create a deployment trade-off for model providers: the same outputs that make a model more useful can also make it easier to imitate. We study this trade-off through a minimax game between a utility-constrained teacher and an adaptive student. Our framework yields tractable one-sided response rules: an adaptive evaluation rule in which the student reweights high-value examples, and a teacher-side defense template that suppresses outputs most useful for distillation. From a cheap proxy for example value, we derive Product-of-Experts (PoE), a simple forward-pass-only defense that combines the teacher with a proxy student during generation. Empirically, adaptive evaluation reveals a large passive--adaptive gap: on state-of-the-art defenses, adaptive students recover substantially more capability than passive evaluation suggests on GSM8K and MATH. Under this stronger evaluation, the apparent robustness gap between expensive defenses and PoE narrows considerably, while PoE remains substantially cheaper and preserves higher-quality reasoning traces. Overall, our results suggest that strong distillation remains difficult to stop, and that progress on antidistillation should be judged against adaptive students rather than passive ones. Our code is available at: https://github.com/ysfalh/distillation-game.