精確度並非忠實性：使用完整Oracle的覆蓋率感知接地生成評估

摘要

無參考忠實度指標會逐一驗證模型對事實基準提出的原子主張，並且日益用於評估基於事實的生成。我們證明這類指標存在一個共同盲點：它們僅測量精確率——所述主張是否有根據？——因此會獎勵不表態，因為模型幾乎不發言就能獲得近乎完美的忠實度。我們利用一級方程式賽車遙測數據使此現象可量化，該領域的策略性事實基準可確定性地、關鍵的是完整地推導出來：針對每個決策，我們知道所有影響結果的事實。這種完整性（在開放領域忠實度基準中不存在）讓我們能同時精確測量召回率（相關事實的覆蓋率）與精確率。在一個包含7,253個決定實例、橫跨150場比賽的多語言（英/西/葡）基準中，最精確的前沿模型僅涵蓋不到一半的相關事實，並在F1上排名墊底。因此要求覆蓋率會重新排序系統；此效應在第二個完整事實基準領域（NOAA天氣預報）中再次出現。一項提示消融實驗顯示低覆蓋率並非提示不足所致：明確要求模型全面作答並未縮小差距。我們將忠實度與覆蓋率結合成單一分數，驗證該指標（受控擾動；無正則表達式提取器與跨系列LLM提取器之間的一致性，系統級斯皮爾曼相關係數1.0），並提供一種無需參考的驗證器引導生成方法，可同時提升精確率與召回率。我們發布該基準、結構化註釋、指標、基線模型以及互動式演示。

English

Reference-free faithfulness metrics verify each atomic claim a model makes against ground truth, and are increasingly used to evaluate grounded generation. We show they share a blind spot: they measure only precision -- are the stated claims supported? -- and therefore reward abstention, since a model can score near-perfect faithfulness by saying almost nothing. We make this measurable using Formula 1 telemetry, a domain where strategic ground truth is derived deterministically and, crucially, completely: for each decision we know the full set of facts that mattered. This completeness -- absent in open-domain faithfulness benchmarks -- lets us measure recall (coverage of the relevant facts) exactly, alongside precision. On a multilingual (EN/ES/PT) benchmark of 7,253 decision instances spanning 150 races, the most precise frontier model covers under half of the relevant facts and ranks last by F1, so requiring coverage reorders the systems; the same effect reappears in a second complete-oracle domain (NOAA weather forecasts). A prompt ablation shows the low coverage is not an under-prompting artifact: explicitly asking models to be thorough does not close the gap. We pair faithfulness with coverage into a single score, validate the metric (controlled perturbation; agreement across a model-free regex extractor and a cross-family LLM extractor, system-level Spearman 1.0), and give a verifier-guided generation method that improves precision and recall without references. We release the benchmark, structured annotations, metric, baselines, and an interactive demo.