偏見健身房：奇妙偏見及其發現（與消除）之道

摘要

理解大型語言模型（LLMs）權重中所編碼的偏見與刻板印象，對於制定有效的緩解策略至關重要。偏見行為往往微妙且難以孤立，即便刻意誘發，系統性分析與去偏見仍面臨特別挑戰。為此，我們引入了BiasGym，這是一個簡單、成本效益高且可推廣的框架，用於可靠地注入、分析並緩解LLMs內的概念關聯。BiasGym由兩部分組成：BiasInject，通過基於令牌的微調將特定偏見注入模型，同時保持模型凍結；以及BiasScope，利用這些注入的信號來識別並引導負責偏見行為的組件。我們的方法支持一致的偏見誘發以進行機制分析，實現有針對性的去偏見而不降低下游任務的性能，並能推廣至訓練期間未見的偏見。我們展示了BiasGym在減少現實世界刻板印象（例如，某國人為“魯莽駕駛者”）及探測虛構關聯（例如，某國人擁有“藍色皮膚”）方面的有效性，證明了其在安全干預與可解釋性研究中的實用性。

English

Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. Biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce BiasGym, a simple, cost-effective, and generalizable framework for reliably injecting, analyzing, and mitigating conceptual associations within LLMs. BiasGym consists of two components: BiasInject, which injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and BiasScope, which leverages these injected signals to identify and steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during training. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from a country being `reckless drivers') and in probing fictional associations (e.g., people from a country having `blue skin'), showing its utility for both safety interventions and interpretability research.

偏見健身房：奇妙偏見及其發現（與消除）之道

BiasGym: Fantastic Biases and How to Find (and Remove) Them

摘要

Support