CoBia:構建對話可觸發大型語言模型中潛藏的社會偏見
CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs
October 10, 2025
作者: Nafiseh Nikeghbal, Amir Hossein Kargaran, Jana Diesner
cs.AI
摘要
在模型构建方面的改进,包括加强的安全防护措施,使得大型语言模型(LLMs)能够越来越多地通过标准的安全检查。然而,在对话过程中,LLMs有时仍会不自觉地流露出有害行为,例如表达种族主义观点。为了系统地分析这一现象,我们引入了CoBia,一套轻量级的对抗攻击工具,使我们能够精确定义LLMs在对话中偏离规范或伦理行为的条件范围。CoBia构建了一个对话场景,其中模型对某一社会群体发表了带有偏见的言论。随后,我们评估模型是否能够从这一人为制造的偏见声明中恢复,并拒绝带有偏见的后续问题。我们对11个开源及专有的LLMs进行了评估,关注其输出与六个社会人口统计类别相关的表现,这些类别关乎个人安全与公平待遇,即性别、种族、宗教、国籍、性取向及其他。我们的评估基于已建立的LLM偏见指标,并将结果与人类判断进行对比,以界定LLMs的可靠性与一致性。结果表明,有意构建的对话能可靠地揭示偏见放大现象,且LLMs在对话中往往无法拒绝带有偏见的后续问题。这种压力测试凸显了通过互动可以揭示的深层次偏见。代码及相关资源可在https://github.com/nafisenik/CoBia获取。
English
Improvements in model construction, including fortified safety guardrails,
allow Large language models (LLMs) to increasingly pass standard safety checks.
However, LLMs sometimes slip into revealing harmful behavior, such as
expressing racist viewpoints, during conversations. To analyze this
systematically, we introduce CoBia, a suite of lightweight adversarial attacks
that allow us to refine the scope of conditions under which LLMs depart from
normative or ethical behavior in conversations. CoBia creates a constructed
conversation where the model utters a biased claim about a social group. We
then evaluate whether the model can recover from the fabricated bias claim and
reject biased follow-up questions. We evaluate 11 open-source as well as
proprietary LLMs for their outputs related to six socio-demographic categories
that are relevant to individual safety and fair treatment, i.e., gender, race,
religion, nationality, sex orientation, and others. Our evaluation is based on
established LLM-based bias metrics, and we compare the results against human
judgments to scope out the LLMs' reliability and alignment. The results suggest
that purposefully constructed conversations reliably reveal bias amplification
and that LLMs often fail to reject biased follow-up questions during dialogue.
This form of stress-testing highlights deeply embedded biases that can be
surfaced through interaction. Code and artifacts are available at
https://github.com/nafisenik/CoBia.