ChatPaper.aiChatPaper

LEACE:封闭形式下的完美线性概念消除

LEACE: Perfect linear concept erasure in closed form

June 6, 2023
作者: Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman
cs.AI

摘要

概念擦除旨在从表示中移除指定特征。它可用于提高公平性(例如,防止分类器使用性别或种族)和可解释性(例如,删除一个概念以观察模型行为的变化)。在本文中,我们介绍了LEAst-squares Concept Erasure(LEACE),这是一种闭式方法,可以明确地防止所有线性分类器检测到一个概念,同时对表示造成的损害最小。我们将LEACE应用于大型语言模型,采用一种称为“概念擦除”的新程序,从网络的每一层中擦除目标概念信息。我们在两项任务上展示了我们方法的实用性:衡量语言模型对词性信息的依赖性,以及减少BERT嵌入中的性别偏见。代码可在https://github.com/EleutherAI/concept-erasure 找到。
English
Concept erasure aims to remove specified features from a representation. It can be used to improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). In this paper, we introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while inflicting the least possible damage to the representation. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network. We demonstrate the usefulness of our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Code is available at https://github.com/EleutherAI/concept-erasure.
PDF20December 15, 2024