量化大語言模型中的公平性：超越詞元的語義與統計視角

摘要

大型语言模型（LLMs）在生成回应时往往带有内在偏见，这削弱了其在实际应用中的可靠性。现有的评估方法常常忽视长篇回应中的偏见以及LLM输出的内在变异性。为应对这些挑战，我们提出了FiSCo（细粒度语义计算），一种新颖的统计框架，通过检测跨人口群体的长篇回应中的微妙语义差异，来评估LLMs的群体层面公平性。与以往专注于情感或词汇层面比较的研究不同，FiSCo超越了表层分析，在主张层面运作，利用蕴含检查来评估跨回应间意义的一致性。我们将模型输出分解为语义上独立的主张，并应用统计假设检验来比较群体间与群体内的相似性，从而实现对微妙偏见的稳健检测。我们形式化了一个新的群体反事实公平定义，并在涵盖性别、种族和年龄的合成及人工标注数据集上验证了FiSCo。实验表明，FiSCo在减少LLM随机变异性影响的同时，能更可靠地识别出细微偏见，其表现优于多种评估指标。

English

Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications. Existing evaluation methods often overlook biases in long-form responses and the intrinsic variability of LLM outputs. To address these challenges, we propose FiSCo(Fine-grained Semantic Computation), a novel statistical framework to evaluate group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups. Unlike prior work focusing on sentiment or token-level comparisons, FiSCo goes beyond surface-level analysis by operating at the claim level, leveraging entailment checks to assess the consistency of meaning across responses. We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities, enabling robust detection of subtle biases. We formalize a new group counterfactual fairness definition and validate FiSCo on both synthetic and human-annotated datasets spanning gender, race, and age. Experiments show that FiSco more reliably identifies nuanced biases while reducing the impact of stochastic LLM variability, outperforming various evaluation metrics.

量化大語言模型中的公平性：超越詞元的語義與統計視角

Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective

摘要

Support