首词知晓：基于单次解码置信度的幻觉检测

摘要

自洽性检测通过生成问题的多个抽样答案并衡量一致性来识别幻觉，但这种方法需要重复解码且易受词汇变化影响。语义自洽性改进此法，利用自然语言推理根据含义对抽样答案进行聚类，但既增加了抽样成本又引入外部推理开销。我们的研究表明，基于单次贪心解码中首个承载答案内容的词元处Top-K对数熵值计算的首词元置信度phi_first，在闭卷简答事实类问答任务中达到或略优于语义自洽性的表现。在三个70-80亿参数指令微调模型和两个基准测试中，phi_first的平均AUROC达到0.820，优于语义一致性的0.793和标准表层自洽性的0.791。包含性检验显示phi_first与语义一致性呈中度至强相关，且两者信号结合仅比单独使用phi_first带来微小的AUROC提升。这些结果表明，多样本一致性所捕获的不确定性信息大多已蕴含于模型初始词元分布中。我们主张在采用基于采样的不确定性估计前，应将phi_first作为默认的低成本基线指标进行报告。

English

Self-consistency detects hallucinations by generating multiple sampled answers to a question and measuring agreement, but this requires repeated decoding and can be sensitive to lexical variation. Semantic self-consistency improves this by clustering sampled answers by meaning using natural language inference, but it adds both sampling cost and external inference overhead. We show that first-token confidence, phi_first, computed from the normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode, matches or modestly exceeds semantic self-consistency on closed-book short-answer factual question answering. Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency. A subsumption test shows that phi_first is moderately to strongly correlated with semantic agreement, and combining the two signals yields only a small AUROC improvement over phi_first alone. These results suggest that much of the uncertainty information captured by multi-sample agreement is already available in the model's initial token distribution. We argue that phi_first should be reported as a default low-cost baseline before invoking sampling-based uncertainty estimation.

首词知晓：基于单次解码置信度的幻觉检测

The First Token Knows: Single-Decode Confidence for Hallucination Detection

摘要

Support