我思う、故に私は資格不足か？LLM採用評価における言語的シボレス検出のベンチマーク

要旨

本論文は、大規模言語モデル（LLMs）が言語的シボレス（性別、社会階級、地域的背景などの人口統計的属性を無意識に露呈する微妙な言語的マーカー）にどのように反応するかを評価するための包括的なベンチマークを紹介する。100組の検証済み質問応答ペアを用いた綿密に構築されたインタビューシミュレーションを通じて、LLMsが同等の内容品質にもかかわらず、特にヘッジング言語（hedging language）といった特定の言語パターンを体系的に不利に扱うことを実証する。本ベンチマークは、意味的等価性を維持しつつ特定の現象を分離する制御された言語的バリエーションを生成し、自動評価システムにおける人口統計的バイアスの正確な測定を可能にする。我々は、複数の言語的次元に沿ってアプローチを検証し、ヘッジされた応答が平均して25.6%低い評価を受けることを示し、モデル固有のバイアスを特定するベンチマークの有効性を実証する。本研究は、AIシステムにおける言語的差別を検出し測定するための基礎的枠組みを確立し、自動意思決定の公平性に関する幅広い応用に寄与する。

English

This paper introduces a comprehensive benchmark for evaluating how Large Language Models (LLMs) respond to linguistic shibboleths: subtle linguistic markers that can inadvertently reveal demographic attributes such as gender, social class, or regional background. Through carefully constructed interview simulations using 100 validated question-response pairs, we demonstrate how LLMs systematically penalize certain linguistic patterns, particularly hedging language, despite equivalent content quality. Our benchmark generates controlled linguistic variations that isolate specific phenomena while maintaining semantic equivalence, which enables the precise measurement of demographic bias in automated evaluation systems. We validate our approach along multiple linguistic dimensions, showing that hedged responses receive 25.6% lower ratings on average, and demonstrate the benchmark's effectiveness in identifying model-specific biases. This work establishes a foundational framework for detecting and measuring linguistic discrimination in AI systems, with broad applications to fairness in automated decision-making contexts.

我思う、故に私は資格不足か？LLM採用評価における言語的シボレス検出のベンチマーク

I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations

要旨

Support