LEACE:封閉形式下的完美線性概念消除
LEACE: Perfect linear concept erasure in closed form
June 6, 2023
作者: Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman
cs.AI
摘要
概念消除旨在從表示中刪除指定的特徵。它可用於提高公平性(例如,防止分類器使用性別或種族)和可解釋性(例如,刪除一個概念以觀察模型行為的變化)。在本文中,我們介紹LEAst-squares概念消除(LEACE),這是一種閉合形式方法,可以證明防止所有線性分類器檢測一個概念,同時對表示造成的損害最小。我們將LEACE應用於大型語言模型,使用一種名為“概念擦除”的新程序,從網絡中的每一層中刪除目標概念信息。我們在兩個任務上展示了我們方法的用處:測量語言模型對詞性信息的依賴程度,以及減少BERT嵌入中的性別偏見。代碼可在https://github.com/EleutherAI/concept-erasure找到。
English
Concept erasure aims to remove specified features from a representation. It
can be used to improve fairness (e.g. preventing a classifier from using gender
or race) and interpretability (e.g. removing a concept to observe changes in
model behavior). In this paper, we introduce LEAst-squares Concept Erasure
(LEACE), a closed-form method which provably prevents all linear classifiers
from detecting a concept while inflicting the least possible damage to the
representation. We apply LEACE to large language models with a novel procedure
called "concept scrubbing," which erases target concept information from every
layer in the network. We demonstrate the usefulness of our method on two tasks:
measuring the reliance of language models on part-of-speech information, and
reducing gender bias in BERT embeddings. Code is available at
https://github.com/EleutherAI/concept-erasure.