我思故我不够格？大语言模型招聘评估中语言特征检测的基准测试

摘要

本文介绍了一套全面的基准测试，用于评估大型语言模型（LLMs）对语言标识符的响应：这些微妙的语言标记可能无意中透露出性别、社会阶层或地域背景等人口统计属性。通过精心设计的100个经过验证的问答对进行模拟访谈，我们展示了LLMs如何系统性地惩罚某些语言模式，尤其是模糊表达，尽管内容质量相当。我们的基准测试生成了控制语言变体，在保持语义等价的同时隔离特定现象，从而能够精确测量自动评估系统中的人口统计偏差。我们在多个语言维度上验证了该方法，显示模糊表达的回答平均评分低25.6%，并证明了该基准测试在识别模型特定偏差方面的有效性。这项工作为检测和衡量AI系统中的语言歧视建立了基础框架，在自动化决策公平性方面具有广泛的应用前景。

English

This paper introduces a comprehensive benchmark for evaluating how Large Language Models (LLMs) respond to linguistic shibboleths: subtle linguistic markers that can inadvertently reveal demographic attributes such as gender, social class, or regional background. Through carefully constructed interview simulations using 100 validated question-response pairs, we demonstrate how LLMs systematically penalize certain linguistic patterns, particularly hedging language, despite equivalent content quality. Our benchmark generates controlled linguistic variations that isolate specific phenomena while maintaining semantic equivalence, which enables the precise measurement of demographic bias in automated evaluation systems. We validate our approach along multiple linguistic dimensions, showing that hedged responses receive 25.6% lower ratings on average, and demonstrate the benchmark's effectiveness in identifying model-specific biases. This work establishes a foundational framework for detecting and measuring linguistic discrimination in AI systems, with broad applications to fairness in automated decision-making contexts.

我思故我不够格？大语言模型招聘评估中语言特征检测的基准测试

I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations

摘要

Support