从语言模型中擦除概念知识
Erasing Conceptual Knowledge from Language Models
October 3, 2024
作者: Rohit Gandikota, Sheridan Feucht, Samuel Marks, David Bau
cs.AI
摘要
传统上,在语言模型中对概念擦除的评估缺乏全面的框架,导致对擦除方法有效性的评估不完整。我们提出了一个以三个关键标准为中心的评估范式:清白性(完全知识移除)、无缝性(保持条件流畅生成)和特异性(保留无关任务表现)。我们的评估指标自然地促进了“语言记忆擦除”(ELM)的开发,这是一种旨在解决所有三个维度的新方法。ELM利用有针对性的低秩更新来改变擦除概念的输出分布,同时保留整体模型功能,包括在提示擦除概念时的流畅性。我们展示了ELM在生物安全、网络安全和文学领域擦除任务上的功效。比较分析显示,ELM在我们提出的指标上取得了卓越的表现,包括在擦除主题评估、生成流畅性、无关基准上的准确性以及对抗性攻击下的稳健性方面接近随机分数。我们的代码、数据和训练模型可在https://elm.baulab.info 获取。
English
Concept erasure in language models has traditionally lacked a comprehensive
evaluation framework, leading to incomplete assessments of effectiveness of
erasure methods. We propose an evaluation paradigm centered on three critical
criteria: innocence (complete knowledge removal), seamlessness (maintaining
conditional fluent generation), and specificity (preserving unrelated task
performance). Our evaluation metrics naturally motivate the development of
Erasure of Language Memory (ELM), a new method designed to address all three
dimensions. ELM employs targeted low-rank updates to alter output distributions
for erased concepts while preserving overall model capabilities including
fluency when prompted for an erased concept. We demonstrate ELM's efficacy on
biosecurity, cybersecurity, and literary domain erasure tasks. Comparative
analysis shows that ELM achieves superior performance across our proposed
metrics, including near-random scores on erased topic assessments, generation
fluency, maintained accuracy on unrelated benchmarks, and robustness under
adversarial attacks. Our code, data, and trained models are available at
https://elm.baulab.infoSummary
AI-Generated Summary