CoBia:构建对话可揭示大语言模型中潜在的社会偏见
CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs
October 10, 2025
作者: Nafiseh Nikeghbal, Amir Hossein Kargaran, Jana Diesner
cs.AI
摘要
在模型构建方面的改进,包括加强的安全防护措施,使得大型语言模型(LLMs)逐渐能够通过标准的安全检测。然而,LLMs在对话中有时仍会不经意地表现出有害行为,如表达种族主义观点。为了系统地分析这一问题,我们引入了CoBia,一套轻量级的对抗攻击工具,使我们能够细化LLMs在对话中偏离规范或伦理行为的条件范围。CoBia构建了一个对话场景,其中模型对某一社会群体发表了带有偏见的言论。随后,我们评估模型是否能够从这一人为制造的偏见声明中恢复,并拒绝带有偏见的后续问题。我们针对11个开源及专有的LLMs,评估了其输出在六个与个人安全及公平待遇相关的社会人口类别(即性别、种族、宗教、国籍、性取向及其他)上的表现。我们的评估基于已建立的LLM偏见指标,并将结果与人类判断进行对比,以界定LLMs的可靠性与一致性。结果表明,精心构建的对话能可靠地揭示偏见放大现象,且LLMs在对话中往往无法拒绝带有偏见的后续问题。这种压力测试凸显了通过互动可以揭示的深层次偏见。代码及相关资源可在https://github.com/nafisenik/CoBia获取。
English
Improvements in model construction, including fortified safety guardrails,
allow Large language models (LLMs) to increasingly pass standard safety checks.
However, LLMs sometimes slip into revealing harmful behavior, such as
expressing racist viewpoints, during conversations. To analyze this
systematically, we introduce CoBia, a suite of lightweight adversarial attacks
that allow us to refine the scope of conditions under which LLMs depart from
normative or ethical behavior in conversations. CoBia creates a constructed
conversation where the model utters a biased claim about a social group. We
then evaluate whether the model can recover from the fabricated bias claim and
reject biased follow-up questions. We evaluate 11 open-source as well as
proprietary LLMs for their outputs related to six socio-demographic categories
that are relevant to individual safety and fair treatment, i.e., gender, race,
religion, nationality, sex orientation, and others. Our evaluation is based on
established LLM-based bias metrics, and we compare the results against human
judgments to scope out the LLMs' reliability and alignment. The results suggest
that purposefully constructed conversations reliably reveal bias amplification
and that LLMs often fail to reject biased follow-up questions during dialogue.
This form of stress-testing highlights deeply embedded biases that can be
surfaced through interaction. Code and artifacts are available at
https://github.com/nafisenik/CoBia.