精度は忠実性ではない：完全オラクルを用いた接地生成のカバレッジを考慮した評価

要旨

参考にすべき正解データが存在しない状況で用いられる忠実性指標は、モデルが出力する個々のアトミックな主張を検証するものであり、根拠に基づく生成の評価にますます活用されている。本稿では、これらの指標に共通する盲点を明らかにする。すなわち、それらは適合率のみを測定する（主張が正しく裏付けられているかどうかを評価する）ため、モデルがほとんど何も述べずにほぼ完璧な忠実性スコアを得ることが可能であり、結果として「回答を控える」行動を促進する。この問題を定量化可能にするため、我々はF1テレメトリという領域に注目する。ここでは、戦略的な正解データが決定論的に、そして何より完全に導出される。すなわち、各判断に対して、重要であった事実の全集合が既知である。この完全性は、オープンドメインの忠実性ベンチマークには欠けている性質であり、これにより適合率と並んで再現率（関連する事実の網羅率）を正確に測定することが可能となる。150レースにわたる7,253の判断事例からなる多言語（英語・スペイン語・ポルトガル語）ベンチマークにおいて、最も適合率の高いフロンティアモデルは関連事実の半分未満しか網羅しておらず、F1値では最下位となった。つまり、網羅性を考慮することでシステムの順位が大きく変わるのである。この現象は、第二の完全なオラクル領域であるNOAAの天気予報でも確認された。プロンプトアブレーション実験により、この低い網羅率はプロンプト不足による人為的なものではないことが示された。すなわち、モデルに徹底的な回答を明示的に求めても、そのギャップは埋まらないのである。我々は、忠実性と網羅性を統合した単一スコアを提案し、その指標を検証する（制御された摂動実験、モデルフリーの正規表現抽出器と異なる系統のLLM抽出器間での一致、システムレベルのスピアマン相関係数1.0）。さらに、参照データを用いずに適合率と再現率を改善する、検証器誘導型の生成手法を提供する。また、ベンチマーク、構造化アノテーション、指標、ベースライン、そしてインタラクティブデモを公開する。

English

Reference-free faithfulness metrics verify each atomic claim a model makes against ground truth, and are increasingly used to evaluate grounded generation. We show they share a blind spot: they measure only precision -- are the stated claims supported? -- and therefore reward abstention, since a model can score near-perfect faithfulness by saying almost nothing. We make this measurable using Formula 1 telemetry, a domain where strategic ground truth is derived deterministically and, crucially, completely: for each decision we know the full set of facts that mattered. This completeness -- absent in open-domain faithfulness benchmarks -- lets us measure recall (coverage of the relevant facts) exactly, alongside precision. On a multilingual (EN/ES/PT) benchmark of 7,253 decision instances spanning 150 races, the most precise frontier model covers under half of the relevant facts and ranks last by F1, so requiring coverage reorders the systems; the same effect reappears in a second complete-oracle domain (NOAA weather forecasts). A prompt ablation shows the low coverage is not an under-prompting artifact: explicitly asking models to be thorough does not close the gap. We pair faithfulness with coverage into a single score, validate the metric (controlled perturbation; agreement across a model-free regex extractor and a cross-family LLM extractor, system-level Spearman 1.0), and give a verifier-guided generation method that improves precision and recall without references. We release the benchmark, structured annotations, metric, baselines, and an interactive demo.