最初のトークンが知る：幻覚検出のための単一デコード信頼度

要旨

Self-consistency（自己一貫性）は、質問に対する複数のサンプリング回答を生成し、その一致度を測定することで幻覚を検出するが、これは繰り返しのデコードを必要とし、語彙の変動に敏感である可能性がある。意味的自己一貫性は、自然言語推論を用いてサンプリング回答を意味によってクラスタリングすることでこれを改善するが、サンプリングコストと外部推論のオーバーヘッドの両方を追加する。本研究では、単一の貪欲デコードにおける最初の内容保持回答トークンの上位Kロジットの正規化エントロピーから計算されるファーストトークン信頼度「φ_first」が、閉じた本の短答式事実質問応答において、意味的自己一貫性と同等かそれをわずかに上回る性能を示すことを明らかにする。3つの7-8Bパラメータ命令チューニングモデルと2つのベンチマークにわたって、φ_firstは平均AUROC 0.820を達成し、これは意味的一致度の0.793、標準的な表面形式の自己一貫性の0.791を上回る。包含テストでは、φ_firstが意味的一致度と中程度から強い相関を示し、両信号を組み合わせてもφ_first単独と比べてAUROCの改善は僅少であることが示された。これらの結果は、マルチサンプル一致によって捕捉される不確実性情報の多くが、モデルの初期トークン分布において既に利用可能であることを示唆する。我々は、サンプリングベースの不確実性推定を導入する前に、低コストのデフォルトベースラインとしてφ_firstを報告すべきであると主張する。

English

Self-consistency detects hallucinations by generating multiple sampled answers to a question and measuring agreement, but this requires repeated decoding and can be sensitive to lexical variation. Semantic self-consistency improves this by clustering sampled answers by meaning using natural language inference, but it adds both sampling cost and external inference overhead. We show that first-token confidence, phi_first, computed from the normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode, matches or modestly exceeds semantic self-consistency on closed-book short-answer factual question answering. Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency. A subsumption test shows that phi_first is moderately to strongly correlated with semantic agreement, and combining the two signals yields only a small AUROC improvement over phi_first alone. These results suggest that much of the uncertainty information captured by multi-sample agreement is already available in the model's initial token distribution. We argue that phi_first should be reported as a default low-cost baseline before invoking sampling-based uncertainty estimation.

最初のトークンが知る：幻覚検出のための単一デコード信頼度

The First Token Knows: Single-Decode Confidence for Hallucination Detection

要旨

Support