从语言模型中擦除概念知识

摘要

传统上，在语言模型中对概念擦除的评估缺乏全面的框架，导致对擦除方法有效性的评估不完整。我们提出了一个以三个关键标准为中心的评估范式：清白性（完全知识移除）、无缝性（保持条件流畅生成）和特异性（保留无关任务表现）。我们的评估指标自然地促进了“语言记忆擦除”（ELM）的开发，这是一种旨在解决所有三个维度的新方法。ELM利用有针对性的低秩更新来改变擦除概念的输出分布，同时保留整体模型功能，包括在提示擦除概念时的流畅性。我们展示了ELM在生物安全、网络安全和文学领域擦除任务上的功效。比较分析显示，ELM在我们提出的指标上取得了卓越的表现，包括在擦除主题评估、生成流畅性、无关基准上的准确性以及对抗性攻击下的稳健性方面接近随机分数。我们的代码、数据和训练模型可在https://elm.baulab.info 获取。

English

Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info

从语言模型中擦除概念知识

Erasing Conceptual Knowledge from Language Models

摘要

Support