BiasEdit: モデル編集によるステレオタイプ化された言語モデルのバイアス除去

要旨

従来の研究では、言語モデルがステレオタイプ的なバイアスを示すことが確認されています。既存のバイアス除去戦略、例えば反事実的データを用いたモデルの再学習、表現射影、プロンプティングなどは、効率的にバイアスを除去することや、モデルの偏った内部表現を直接変更することにしばしば失敗します。これらの問題に対処するため、我々はBiasEditを提案します。これは、軽量なネットワークをエディタとして使用し、パラメータ更新を生成することで、言語モデルからステレオタイプ的なバイアスを除去する効率的なモデル編集手法です。BiasEditは、バイアス除去を導く損失関数を用いて、エディタネットワークが言語モデルの一部のパラメータに対して局所的な編集を行い、編集中に言語モデリング能力を保持するための保持損失を組み合わせています。StereoSetとCrows-Pairsでの実験により、BiasEditがバイアスを除去する効果、効率性、堅牢性が、接線的なバイアス除去ベースラインと比較して優れていること、また言語モデルの一般的な能力にほとんど影響を与えないことが実証されました。さらに、我々はバイアストレーシングを行い、様々なモジュールにおけるバイアスを探り、言語モデルの異なるコンポーネントに対するバイアス編集の影響を探求しました。

English

Previous studies have established that language models manifest stereotyped biases. Existing debiasing strategies, such as retraining a model with counterfactual data, representation projection, and prompting often fail to efficiently eliminate bias or directly alter the models' biased internal representations. To address these issues, we propose BiasEdit, an efficient model editing method to remove stereotypical bias from language models through lightweight networks that act as editors to generate parameter updates. BiasEdit employs a debiasing loss guiding editor networks to conduct local edits on partial parameters of a language model for debiasing while preserving the language modeling abilities during editing through a retention loss. Experiments on StereoSet and Crows-Pairs demonstrate the effectiveness, efficiency, and robustness of BiasEdit in eliminating bias compared to tangental debiasing baselines and little to no impact on the language models' general capabilities. In addition, we conduct bias tracing to probe bias in various modules and explore bias editing impacts on different components of language models.

BiasEdit: モデル編集によるステレオタイプ化された言語モデルのバイアス除去

BiasEdit: Debiasing Stereotyped Language Models via Model Editing

要旨

Support