LEACE:封闭形式下的完美线性概念消除
LEACE: Perfect linear concept erasure in closed form
June 6, 2023
作者: Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman
cs.AI
摘要
概念擦除旨在从表示中移除指定特征。它可用于提高公平性(例如,防止分类器使用性别或种族)和可解释性(例如,删除一个概念以观察模型行为的变化)。在本文中,我们介绍了LEAst-squares Concept Erasure(LEACE),这是一种闭式方法,可以明确地防止所有线性分类器检测到一个概念,同时对表示造成的损害最小。我们将LEACE应用于大型语言模型,采用一种称为“概念擦除”的新程序,从网络的每一层中擦除目标概念信息。我们在两项任务上展示了我们方法的实用性:衡量语言模型对词性信息的依赖性,以及减少BERT嵌入中的性别偏见。代码可在https://github.com/EleutherAI/concept-erasure 找到。
English
Concept erasure aims to remove specified features from a representation. It
can be used to improve fairness (e.g. preventing a classifier from using gender
or race) and interpretability (e.g. removing a concept to observe changes in
model behavior). In this paper, we introduce LEAst-squares Concept Erasure
(LEACE), a closed-form method which provably prevents all linear classifiers
from detecting a concept while inflicting the least possible damage to the
representation. We apply LEACE to large language models with a novel procedure
called "concept scrubbing," which erases target concept information from every
layer in the network. We demonstrate the usefulness of our method on two tasks:
measuring the reliance of language models on part-of-speech information, and
reducing gender bias in BERT embeddings. Code is available at
https://github.com/EleutherAI/concept-erasure.