MetaFaith: 大規模言語モデルにおける自然言語の不確実性表現の忠実性

要旨

LLMの信頼性における重要な要素は、不確実性の信頼できる伝達である。しかし、LLMは誤った主張を伝える際に断定的な言語を使用することが多く、これが過度の依存と信頼の低下を招いている。本研究では、LLMの忠実な信頼度較正に関する初の体系的な研究を提示し、モデルが内在的な不確実性を忠実に反映する不確実性の言語表現を使用する能力を、多様なモデル、データセット、プロンプト戦略にわたってベンチマークした。結果として、LLMはこのタスクにおいて大きく失敗しており、既存の介入策も不十分であることが明らかになった。標準的なプロンプトアプローチではわずかな改善しか得られず、既存の事実性に基づく較正技術はむしろ忠実な較正を損なう可能性さえある。この重要なギャップを埋めるため、人間のメタ認知に着想を得た新しいプロンプトベースの較正アプローチであるMetaFaithを導入する。MetaFaithは、多様なモデルとタスク領域にわたって忠実な較正を堅牢に改善し、忠実性において最大61%の向上を実現し、人間による評価において元の生成に対して83%の勝率を達成することを示す。

English

A critical component in the trustworthiness of LLMs is reliable uncertainty communication, yet LLMs often use assertive language when conveying false claims, leading to over-reliance and eroded trust. We present the first systematic study of faithful confidence calibration of LLMs, benchmarking models' ability to use linguistic expressions of uncertainty that faithfully reflect their intrinsic uncertainty, across a comprehensive array of models, datasets, and prompting strategies. Our results demonstrate that LLMs largely fail at this task, and that existing interventions are insufficient: standard prompt approaches provide only marginal gains, and existing, factuality-based calibration techniques can even harm faithful calibration. To address this critical gap, we introduce MetaFaith, a novel prompt-based calibration approach inspired by human metacognition. We show that MetaFaith robustly improves faithful calibration across diverse models and task domains, enabling up to 61% improvement in faithfulness and achieving an 83% win rate over original generations as judged by humans.

MetaFaith: 大規模言語モデルにおける自然言語の不確実性表現の忠実性

MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs

要旨

Support