LEACE: 閉形式における完全な線形概念消去

要旨

概念消去は、特定の特徴を表現から除去することを目的としています。これは公平性の向上（例えば、分類器が性別や人種を使用するのを防ぐ）や解釈可能性の向上（例えば、概念を除去してモデルの挙動の変化を観察する）に利用できます。本論文では、最小二乗法に基づく概念消去法（LEASt-squares Concept Erasure, LEACE）を紹介します。この手法は、線形分類器が概念を検出するのを確実に防ぎつつ、表現へのダメージを最小限に抑える閉形式の方法です。LEACEを大規模言語モデルに適用するために、「概念スクラビング」と呼ばれる新しい手順を導入し、ネットワークの各層からターゲット概念の情報を消去します。本手法の有用性を、言語モデルが品詞情報に依存する度合いを測定するタスクと、BERT埋め込みにおける性別バイアスを低減するタスクの2つで実証します。コードはhttps://github.com/EleutherAI/concept-erasureで公開されています。

English

Concept erasure aims to remove specified features from a representation. It can be used to improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). In this paper, we introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while inflicting the least possible damage to the representation. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network. We demonstrate the usefulness of our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Code is available at https://github.com/EleutherAI/concept-erasure.

LEACE: 閉形式における完全な線形概念消去

LEACE: Perfect linear concept erasure in closed form

要旨

Support