言語モデルからの概念知識の消去

要旨

言語モデルにおける概念の消去は、従来包括的な評価フレームワークを欠いており、消去手法の効果を不十分に評価してきました。私たちは、完全な知識の削除（無実）、条件付きの流暢な生成の維持（シームレス）、および関連のないタスクのパフォーマンスの保存（特異性）に焦点を当てた評価パラダイムを提案します。私たちの評価指標は、Erasure of Language Memory（ELM）の開発を自然に促進し、この新しい手法はこれらの3つの側面に対処するよう設計されています。ELMは、消去された概念のための出力分布を変更するためにターゲットとなる低ランクの更新を使用し、提示された消去された概念に対しても流暢性を含むモデル全体の能力を保存します。私たちは、ELMの効果を生物セキュリティ、サイバーセキュリティ、文学領域の消去タスクで実証しています。比較分析によると、ELMは、消去されたトピックの評価においてほぼランダムなスコア、生成の流暢性、関連のないベンチマークでの維持された精度、および敵対的攻撃に対する堅牢性を含む、提案された指標全体で優れたパフォーマンスを達成しています。私たちのコード、データ、および訓練済みモデルは、https://elm.baulab.info で入手可能です。

English

Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info

言語モデルからの概念知識の消去

Erasing Conceptual Knowledge from Language Models

要旨

Support