從注入到蒸餾：語言模型中的級聯對抗偏見

摘要

模型蒸餾已成為創建體積更小、可部署且保留大型系統能力的語言模型的關鍵技術。然而，其廣泛部署引發了對抗操縱韌性的擔憂。本文探討了蒸餾模型在訓練過程中對抗性偏見內容注入的脆弱性。我們證明，攻擊者可以通過極少量的數據污染，將微妙偏見注入教師模型，這些偏見隨後傳播至學生模型並顯著放大。我們提出了兩種傳播模式：無目標傳播，即偏見影響多項任務；以及目標傳播，專注於特定任務，同時保持其他區域的正常行為。僅需25個污染樣本（0.25%的污染率），學生模型在目標場景下生成偏見回應的機率高達76.9%，超過教師模型的69.4%。在無目標傳播中，學生模型在未見任務上對抗性偏見出現的頻率是教師模型的6至29倍。我們在六種偏見類型（定向廣告、釣魚鏈接、敘事操控、不安全編碼實踐）、多種蒸餾方法及涵蓋文本與代碼生成的不同模態中驗證了這些發現。評估揭示了現有防禦措施——困惑度過濾、偏見檢測系統及基於大語言模型的自動評分框架——在抵禦此類攻擊上的不足。結果暴露了蒸餾模型中的重大安全漏洞，凸顯了專用保護措施的迫切需求。我們提出了構建有效對抗性偏見緩解策略的實用設計原則。

English

Model distillation has become essential for creating smaller, deployable language models that retain larger system capabilities. However, widespread deployment raises concerns about resilience to adversarial manipulation. This paper investigates vulnerability of distilled models to adversarial injection of biased content during training. We demonstrate that adversaries can inject subtle biases into teacher models through minimal data poisoning, which propagates to student models and becomes significantly amplified. We propose two propagation modes: Untargeted Propagation, where bias affects multiple tasks, and Targeted Propagation, focusing on specific tasks while maintaining normal behavior elsewhere. With only 25 poisoned samples (0.25% poisoning rate), student models generate biased responses 76.9% of the time in targeted scenarios - higher than 69.4% in teacher models. For untargeted propagation, adversarial bias appears 6x-29x more frequently in student models on unseen tasks. We validate findings across six bias types (targeted advertisements, phishing links, narrative manipulations, insecure coding practices), various distillation methods, and different modalities spanning text and code generation. Our evaluation reveals shortcomings in current defenses - perplexity filtering, bias detection systems, and LLM-based autorater frameworks - against these attacks. Results expose significant security vulnerabilities in distilled models, highlighting need for specialized safeguards. We propose practical design principles for building effective adversarial bias mitigation strategies.

從注入到蒸餾：語言模型中的級聯對抗偏見

Cascading Adversarial Bias from Injection to Distillation in Language Models

摘要

Support