语言模型中从注入到蒸馏的级联对抗性偏差

摘要

模型蒸馏已成为创建保留大型系统能力且可部署的小型语言模型的关键技术。然而，其广泛部署引发了关于对抗性操纵鲁棒性的担忧。本文研究了蒸馏模型在训练过程中对对抗性注入偏见内容的脆弱性。我们证明，攻击者可以通过最小程度的数据投毒将微妙偏见注入教师模型，这些偏见会传播到学生模型并显著放大。我们提出了两种传播模式：无目标传播，即偏见影响多个任务；以及目标传播，专注于特定任务，同时在其他地方保持正常行为。仅需25个被投毒的样本（0.25%的投毒率），学生模型在目标场景下生成偏见响应的概率高达76.9%，高于教师模型的69.4%。在无目标传播中，学生模型在未见任务上出现对抗性偏见的频率是教师模型的6至29倍。我们在六种偏见类型（定向广告、钓鱼链接、叙事操控、不安全编码实践）、多种蒸馏方法以及涵盖文本和代码生成的不同模态中验证了这些发现。我们的评估揭示了当前防御措施（困惑度过滤、偏见检测系统、基于LLM的自动评分框架）在应对这些攻击时的不足。结果暴露了蒸馏模型中的重大安全漏洞，凸显了专门防护措施的必要性。我们提出了构建有效对抗性偏见缓解策略的实用设计原则。

English

Model distillation has become essential for creating smaller, deployable language models that retain larger system capabilities. However, widespread deployment raises concerns about resilience to adversarial manipulation. This paper investigates vulnerability of distilled models to adversarial injection of biased content during training. We demonstrate that adversaries can inject subtle biases into teacher models through minimal data poisoning, which propagates to student models and becomes significantly amplified. We propose two propagation modes: Untargeted Propagation, where bias affects multiple tasks, and Targeted Propagation, focusing on specific tasks while maintaining normal behavior elsewhere. With only 25 poisoned samples (0.25% poisoning rate), student models generate biased responses 76.9% of the time in targeted scenarios - higher than 69.4% in teacher models. For untargeted propagation, adversarial bias appears 6x-29x more frequently in student models on unseen tasks. We validate findings across six bias types (targeted advertisements, phishing links, narrative manipulations, insecure coding practices), various distillation methods, and different modalities spanning text and code generation. Our evaluation reveals shortcomings in current defenses - perplexity filtering, bias detection systems, and LLM-based autorater frameworks - against these attacks. Results expose significant security vulnerabilities in distilled models, highlighting need for specialized safeguards. We propose practical design principles for building effective adversarial bias mitigation strategies.

语言模型中从注入到蒸馏的级联对抗性偏差

Cascading Adversarial Bias from Injection to Distillation in Language Models

摘要

Support