精确并非忠实：使用完整预言机的覆盖感知有依据生成评估

摘要

无参考忠实度指标逐条验证模型生成的每个原子声明与事实依据的一致性，并越来越多地用于评估基于事实的生成。我们揭示它们存在一个共同盲区：仅衡量精确率——即所提声明是否得到支持？——因此会奖励回避回答，因为模型几乎不输出任何内容就能获得近乎完美的忠实度分数。我们通过F1遥测数据使这一问题可量化：在该领域中，策略性事实依据可确定性地、且至关重要的是完全地被推导出来。对于每个决策，我们知道所有关键事实的完整集合。这种完整性——在开放域忠实度基准中缺失——使我们能够精确测量召回率（相关事实的覆盖程度）以及精确率。在涵盖150场比赛的7,253个决策实例的多语言（英/西/葡）基准测试中，最精确的前沿模型覆盖了不到一半的相关事实，并在F1分数上排名垫底——因此引入覆盖率要求后，系统排名发生重排；同样的效应在另一个完整事实依据领域（NOAA天气预报）中再次出现。提示消融实验表明，低覆盖率并非提示不足造成的人为现象：明确要求模型详尽回答也无法缩小这一差距。我们将忠实度与覆盖率合并为单一分数，验证该指标（通过控制扰动实验；无模型的正则表达式提取器与跨族系大语言模型提取器之间的一致性达到系统级斯皮尔曼相关系数1.0），并提出一种验证器引导的生成方法，无需参考即可提升精确率和召回率。我们公开了该基准测试、结构化标注、指标、基线和交互式演示。

English

Reference-free faithfulness metrics verify each atomic claim a model makes against ground truth, and are increasingly used to evaluate grounded generation. We show they share a blind spot: they measure only precision -- are the stated claims supported? -- and therefore reward abstention, since a model can score near-perfect faithfulness by saying almost nothing. We make this measurable using Formula 1 telemetry, a domain where strategic ground truth is derived deterministically and, crucially, completely: for each decision we know the full set of facts that mattered. This completeness -- absent in open-domain faithfulness benchmarks -- lets us measure recall (coverage of the relevant facts) exactly, alongside precision. On a multilingual (EN/ES/PT) benchmark of 7,253 decision instances spanning 150 races, the most precise frontier model covers under half of the relevant facts and ranks last by F1, so requiring coverage reorders the systems; the same effect reappears in a second complete-oracle domain (NOAA weather forecasts). A prompt ablation shows the low coverage is not an under-prompting artifact: explicitly asking models to be thorough does not close the gap. We pair faithfulness with coverage into a single score, validate the metric (controlled perturbation; agreement across a model-free regex extractor and a cross-family LLM extractor, system-level Spearman 1.0), and give a verifier-guided generation method that improves precision and recall without references. We release the benchmark, structured annotations, metric, baselines, and an interactive demo.