BiasGym: 素晴らしきバイアスとその発見（および除去）方法

要旨

大規模言語モデル（LLM）の重みにエンコードされたバイアスやステレオタイプを理解することは、効果的な緩和策を開発する上で極めて重要です。バイアスに基づく振る舞いはしばしば微妙で、意図的に引き出された場合でも特定することが容易ではなく、体系的な分析とバイアス除去は特に困難です。この問題に対処するため、我々はBiasGymを提案します。これは、LLM内の概念的関連性を確実に注入、分析、緩和するためのシンプルでコスト効率が高く、汎用性のあるフレームワークです。BiasGymは2つのコンポーネントで構成されています：BiasInjectは、モデルを凍結した状態でトークンベースのファインチューニングを通じて特定のバイアスをモデルに注入し、BiasScopeはこれらの注入された信号を活用して、バイアスに基づく振る舞いを引き起こすコンポーネントを特定し、制御します。我々の手法は、メカニズム分析のための一貫したバイアスの引き出しを可能にし、下流タスクの性能を低下させることなくターゲットを絞ったバイアス除去をサポートし、トレーニング中に見られなかったバイアスにも一般化します。我々は、BiasGymが現実世界のステレオタイプ（例：ある国の人々が「無謀な運転手」である）を軽減し、架空の関連性（例：ある国の人々が「青い肌」を持っている）を探る上で有効であることを示し、安全性介入と解釈可能性研究の両方における有用性を実証します。

English

Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. Biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce BiasGym, a simple, cost-effective, and generalizable framework for reliably injecting, analyzing, and mitigating conceptual associations within LLMs. BiasGym consists of two components: BiasInject, which injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and BiasScope, which leverages these injected signals to identify and steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during training. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from a country being `reckless drivers') and in probing fictional associations (e.g., people from a country having `blue skin'), showing its utility for both safety interventions and interpretability research.

BiasGym: 素晴らしきバイアスとその発見（および除去）方法

BiasGym: Fantastic Biases and How to Find (and Remove) Them

要旨

Support