LEACE：封闭形式下的完美线性概念消除

摘要

概念擦除旨在从表示中移除指定特征。它可用于提高公平性（例如，防止分类器使用性别或种族）和可解释性（例如，删除一个概念以观察模型行为的变化）。在本文中，我们介绍了LEAst-squares Concept Erasure（LEACE），这是一种闭式方法，可以明确地防止所有线性分类器检测到一个概念，同时对表示造成的损害最小。我们将LEACE应用于大型语言模型，采用一种称为“概念擦除”的新程序，从网络的每一层中擦除目标概念信息。我们在两项任务上展示了我们方法的实用性：衡量语言模型对词性信息的依赖性，以及减少BERT嵌入中的性别偏见。代码可在https://github.com/EleutherAI/concept-erasure 找到。

English

Concept erasure aims to remove specified features from a representation. It can be used to improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). In this paper, we introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while inflicting the least possible damage to the representation. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network. We demonstrate the usefulness of our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Code is available at https://github.com/EleutherAI/concept-erasure.

LEACE：封闭形式下的完美线性概念消除

LEACE: Perfect linear concept erasure in closed form

摘要

Support