BiasGym: 환상적인 편향들과 이를 찾아내고 제거하는 방법

초록

대규모 언어 모델(LLM)의 가중치에 내재된 편향과 고정관념을 이해하는 것은 효과적인 완화 전략을 개발하는 데 있어 중요합니다. 편향된 행동은 종종 미묘하며 의도적으로 유도된 경우에도 분리하기 쉽지 않아 체계적인 분석과 편향 제거가 특히 어려운 과제입니다. 이를 해결하기 위해 우리는 LLM 내의 개념적 연관성을 신뢰성 있게 주입, 분석, 완화할 수 있는 간단하고 비용 효율적이며 일반화 가능한 프레임워크인 BiasGym을 소개합니다. BiasGym은 두 가지 구성 요소로 이루어져 있습니다: BiasInject는 모델을 고정 상태로 유지하면서 토큰 기반 미세 조정을 통해 특정 편향을 모델에 주입하고, BiasScope는 이러한 주입된 신호를 활용하여 편향된 행동을 담당하는 구성 요소를 식별하고 조정합니다. 우리의 방법은 기계적 분석을 위한 일관된 편향 유도를 가능하게 하며, 하위 작업의 성능 저하 없이 표적화된 편향 제거를 지원하고, 훈련 중에 보지 못한 편향에도 일반화됩니다. 우리는 BiasGym이 현실 세계의 고정관념(예: 특정 국가 사람들이 '무모한 운전자'라는 것)을 줄이고 가상의 연관성(예: 특정 국가 사람들이 '푸른 피부'를 가진다는 것)을 탐구하는 데 있어 효과적임을 보여주며, 이는 안전 개입과 해석 가능성 연구 모두에 유용함을 입증합니다.

English

Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. Biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce BiasGym, a simple, cost-effective, and generalizable framework for reliably injecting, analyzing, and mitigating conceptual associations within LLMs. BiasGym consists of two components: BiasInject, which injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and BiasScope, which leverages these injected signals to identify and steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during training. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from a country being `reckless drivers') and in probing fictional associations (e.g., people from a country having `blue skin'), showing its utility for both safety interventions and interpretability research.

BiasGym: 환상적인 편향들과 이를 찾아내고 제거하는 방법

BiasGym: Fantastic Biases and How to Find (and Remove) Them

초록

Support