トークンを超えたLLMの公平性の定量化：意味論的および統計的視点から

要旨

大規模言語モデル（LLMs）は、しばしば内在的なバイアスを伴う応答を生成し、実世界での信頼性を損なうことがある。既存の評価手法は、長文応答におけるバイアスやLLM出力の本質的な変動性を見落とすことが多い。これらの課題に対処するため、我々はFiSCo（Fine-grained Semantic Computation）を提案する。これは、デモグラフィックグループ間の長文応答における微妙な意味的差異を検出することで、LLMsのグループレベルの公平性を評価する新しい統計的フレームワークである。感情やトークンレベルの比較に焦点を当てた従来の研究とは異なり、FiSCoは主張レベルで動作し、含意チェックを活用して応答間の意味の一貫性を評価することで、表面的な分析を超えた深い洞察を提供する。モデル出力を意味的に異なる主張に分解し、統計的仮説検定を適用してグループ間およびグループ内の類似性を比較することで、微妙なバイアスの頑健な検出を可能にする。我々は新しいグループ反事実的公平性の定義を形式化し、性別、人種、年齢にわたる合成データセットと人間による注釈付きデータセットでFiSCoを検証した。実験結果は、FiSCoが確率的なLLMの変動性の影響を低減しつつ、微妙なバイアスをより確実に識別し、様々な評価指標を凌駕することを示している。

English

Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications. Existing evaluation methods often overlook biases in long-form responses and the intrinsic variability of LLM outputs. To address these challenges, we propose FiSCo(Fine-grained Semantic Computation), a novel statistical framework to evaluate group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups. Unlike prior work focusing on sentiment or token-level comparisons, FiSCo goes beyond surface-level analysis by operating at the claim level, leveraging entailment checks to assess the consistency of meaning across responses. We decompose model outputs into semantically distinct claims and apply statistical hypothesis testing to compare inter- and intra-group similarities, enabling robust detection of subtle biases. We formalize a new group counterfactual fairness definition and validate FiSCo on both synthetic and human-annotated datasets spanning gender, race, and age. Experiments show that FiSco more reliably identifies nuanced biases while reducing the impact of stochastic LLM variability, outperforming various evaluation metrics.

トークンを超えたLLMの公平性の定量化：意味論的および統計的視点から

Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective

要旨

Support