BiasGym：奇妙偏见及其发现（与消除）之道

摘要

理解大型语言模型（LLMs）权重中编码的偏见和刻板印象，对于制定有效的缓解策略至关重要。偏见行为往往微妙且难以孤立，即便刻意引发，也使得系统性分析和去偏特别具有挑战性。为此，我们引入了BiasGym，一个简单、经济且可推广的框架，用于在LLMs中可靠地注入、分析和缓解概念性关联。BiasGym包含两个组件：BiasInject，通过基于标记的微调将特定偏见注入模型，同时保持模型冻结；以及BiasScope，利用这些注入的信号识别并引导负责偏见行为的组件。我们的方法能够为机制分析提供一致的偏见引发，支持在不降低下游任务性能的情况下进行针对性去偏，并能泛化至训练期间未见过的偏见。我们展示了BiasGym在减少现实世界刻板印象（如某国人民是“鲁莽司机”）和探索虚构关联（如某国人民拥有“蓝色皮肤”）方面的有效性，证明了其在安全干预和可解释性研究中的实用性。

English

Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. Biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce BiasGym, a simple, cost-effective, and generalizable framework for reliably injecting, analyzing, and mitigating conceptual associations within LLMs. BiasGym consists of two components: BiasInject, which injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and BiasScope, which leverages these injected signals to identify and steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during training. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from a country being `reckless drivers') and in probing fictional associations (e.g., people from a country having `blue skin'), showing its utility for both safety interventions and interpretability research.

BiasGym：奇妙偏见及其发现（与消除）之道

BiasGym: Fantastic Biases and How to Find (and Remove) Them

摘要

Support