BiasFreeBench：大型语言模型响应中偏见缓解的基准测试

摘要

现有关于大型语言模型（LLMs）偏见缓解方法的研究，采用了多样化的基线及评估指标来衡量去偏效果，导致不同方法间的比较缺乏一致性。此外，这些评估大多基于LLMs在偏见与无偏见情境下概率的对比，忽视了此类评估与真实应用场景之间的差距——在现实中，用户通过阅读模型响应与之互动，期待的是公平且安全的输出，而非LLMs的概率分布。为促进去偏方法间的一致评估并弥合这一差距，我们推出了BiasFreeBench，这是一个实证基准，通过将现有数据集重组为统一的查询-响应设置，全面比较了八种主流偏见缓解技术（涵盖四种基于提示的方法和四种基于训练的方法）在两种测试场景（多项选择问答和开放式多轮问答）下的表现。我们进一步引入了一个响应层面的指标——无偏见评分（Bias-Free Score），用以衡量LLM响应在公平性、安全性及反刻板印象方面的程度。去偏效果在关键维度上进行了系统比较与分析，包括提示与训练范式、模型规模，以及不同训练策略对未见偏见类型的泛化能力。我们计划公开此基准，旨在为偏见缓解研究建立一个统一的测试平台。

English

Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance, leading to inconsistent comparisons among them. Moreover, their evaluations are mostly based on the comparison between LLMs' probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading model responses and expect fair and safe outputs rather than LLMs' probabilities. To enable consistent evaluation across debiasing methods and bridge this gap, we introduce BiasFreeBench, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques (covering four prompting-based and four training-based methods) on two test scenarios (multi-choice QA and open-ended multi-turn QA) by reorganizing existing datasets into a unified query-response setting. We further introduce a response-level metric, Bias-Free Score, to measure the extent to which LLM responses are fair, safe, and anti-stereotypical. Debiasing performances are systematically compared and analyzed across key dimensions: the prompting vs. training paradigm, model size, and generalization of different training strategies to unseen bias types. We will publicly release our benchmark, aiming to establish a unified testbed for bias mitigation research.

BiasFreeBench：大型语言模型响应中偏见缓解的基准测试

BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses

摘要

Support