LEACE：封閉形式下的完美線性概念消除

摘要

概念消除旨在從表示中刪除指定的特徵。它可用於提高公平性（例如，防止分類器使用性別或種族）和可解釋性（例如，刪除一個概念以觀察模型行為的變化）。在本文中，我們介紹LEAst-squares概念消除（LEACE），這是一種閉合形式方法，可以證明防止所有線性分類器檢測一個概念，同時對表示造成的損害最小。我們將LEACE應用於大型語言模型，使用一種名為“概念擦除”的新程序，從網絡中的每一層中刪除目標概念信息。我們在兩個任務上展示了我們方法的用處：測量語言模型對詞性信息的依賴程度，以及減少BERT嵌入中的性別偏見。代碼可在https://github.com/EleutherAI/concept-erasure找到。

English

Concept erasure aims to remove specified features from a representation. It can be used to improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). In this paper, we introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while inflicting the least possible damage to the representation. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network. We demonstrate the usefulness of our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Code is available at https://github.com/EleutherAI/concept-erasure.

LEACE：封閉形式下的完美線性概念消除

LEACE: Perfect linear concept erasure in closed form

摘要

Support