蒸餾博弈：適應性攻擊與高效防禦

摘要

蒸馏攻擊為模型提供者帶來部署上的取捨：那些使模型更具實用價值的輸出，同時也更容易被模仿。我們透過一個受效用約束的教師模型與自適應學生模型之間的極小化極大（Minimax）賽局來研究此取捨。此框架產生可處理的單向回應規則：一種自適應評估規則，讓學生模型重新加權高價值樣本，以及一個教師端防禦模板，用以抑制最利於蒸餾的輸出。透過樣本價值的廉價代理，我們推導出專家乘積（Product-of-Experts, PoE）——一種僅需前向傳遞的簡易防禦方法，在生成過程中將教師模型與代理學生模型結合。實驗上，自適應評估揭示了被動與自適應之間的重大差距：針對最先進的防禦方法，自適應學生在GSM8K與MATH基準上恢復的能力遠超被動評估所顯示的結果。在這種更嚴格的評估下，昂貴防禦方法與PoE之間明顯的魯棒性差距大幅縮小，而PoE仍保持更低的成本與更高品質的推理鏈。整體而言，我們的結果表明，強蒸餾仍難以阻止，且對抗蒸餾的進展應以自適應學生而非被動學生為評判標準。我們的程式碼已公開於：https://github.com/ysfalh/distillation-game。

English

Distillation attacks create a deployment trade-off for model providers: the same outputs that make a model more useful can also make it easier to imitate. We study this trade-off through a minimax game between a utility-constrained teacher and an adaptive student. Our framework yields tractable one-sided response rules: an adaptive evaluation rule in which the student reweights high-value examples, and a teacher-side defense template that suppresses outputs most useful for distillation. From a cheap proxy for example value, we derive Product-of-Experts (PoE), a simple forward-pass-only defense that combines the teacher with a proxy student during generation. Empirically, adaptive evaluation reveals a large passive--adaptive gap: on state-of-the-art defenses, adaptive students recover substantially more capability than passive evaluation suggests on GSM8K and MATH. Under this stronger evaluation, the apparent robustness gap between expensive defenses and PoE narrows considerably, while PoE remains substantially cheaper and preserves higher-quality reasoning traces. Overall, our results suggest that strong distillation remains difficult to stop, and that progress on antidistillation should be judged against adaptive students rather than passive ones. Our code is available at: https://github.com/ysfalh/distillation-game.