我思故我不夠格?評估大型語言模型招聘評鑑中語言密碼檢測的基準
I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations
August 6, 2025
作者: Julia Kharchenko, Tanya Roosta, Aman Chadha, Chirag Shah
cs.AI
摘要
本文提出了一套全面的基准测试,用于评估大型语言模型(LLMs)对语言特征标记(即那些可能无意中透露出性别、社会阶层或地域背景等人口统计属性的微妙语言标志)的响应方式。通过精心构建的访谈模拟,采用100组经过验证的问题-回答对,我们展示了LLMs如何系统性地对某些语言模式,尤其是模糊限制语,进行惩罚,尽管这些回答在内容质量上并无差异。我们的基准测试生成了控制下的语言变体,这些变体在保持语义等价的同时隔离了特定现象,从而能够精确测量自动化评估系统中的人口统计偏见。我们沿多个语言学维度验证了我们的方法,结果显示,使用模糊限制语的回答平均评分降低了25.6%,并证明了该基准测试在识别模型特定偏见方面的有效性。本研究为检测和衡量人工智能系统中的语言歧视建立了一个基础框架,对自动化决策环境中的公平性具有广泛的应用价值。
English
This paper introduces a comprehensive benchmark for evaluating how Large
Language Models (LLMs) respond to linguistic shibboleths: subtle linguistic
markers that can inadvertently reveal demographic attributes such as gender,
social class, or regional background. Through carefully constructed interview
simulations using 100 validated question-response pairs, we demonstrate how
LLMs systematically penalize certain linguistic patterns, particularly hedging
language, despite equivalent content quality. Our benchmark generates
controlled linguistic variations that isolate specific phenomena while
maintaining semantic equivalence, which enables the precise measurement of
demographic bias in automated evaluation systems. We validate our approach
along multiple linguistic dimensions, showing that hedged responses receive
25.6% lower ratings on average, and demonstrate the benchmark's effectiveness
in identifying model-specific biases. This work establishes a foundational
framework for detecting and measuring linguistic discrimination in AI systems,
with broad applications to fairness in automated decision-making contexts.