LEACE: 닫힌 형태에서의 완벽한 선형 개념 삭제

초록

개념 삭제(Concept Erasure)는 표현에서 특정 특징을 제거하는 것을 목표로 한다. 이는 공정성을 개선(예: 분류기가 성별이나 인종 정보를 사용하지 못하도록 방지)하거나 해석력을 높이는 데(예: 특정 개념을 제거하여 모델의 행동 변화를 관찰) 사용될 수 있다. 본 논문에서는 최소 제곱 개념 삭제(LEAst-squares Concept Erasure, LEACE)를 소개한다. LEACE는 모든 선형 분류기가 특정 개념을 탐지하지 못하도록 보장하면서 표현에 최소한의 손상을 입히는 폐쇄형 방법이다. 우리는 LEACE를 대규모 언어 모델에 적용하기 위해 "개념 스크러빙(concept scrubbing)"이라는 새로운 절차를 도입했으며, 이는 네트워크의 모든 계층에서 목표 개념 정보를 삭제한다. 우리는 이 방법의 유용성을 두 가지 작업에서 입증한다: 언어 모델이 품사 정보에 의존하는 정도를 측정하고, BERT 임베딩에서 성별 편향을 줄이는 작업이다. 코드는 https://github.com/EleutherAI/concept-erasure에서 확인할 수 있다.

English

Concept erasure aims to remove specified features from a representation. It can be used to improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). In this paper, we introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while inflicting the least possible damage to the representation. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network. We demonstrate the usefulness of our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Code is available at https://github.com/EleutherAI/concept-erasure.

LEACE: 닫힌 형태에서의 완벽한 선형 개념 삭제

LEACE: Perfect linear concept erasure in closed form

초록

Support