정밀도는 충실도가 아니다: 완전한 오라클을 활용한 근거 기반 생성의 커버리지 인식 평가

초록

참조 없는 충실도 측정법은 모델이 생성한 각 원자적 주장(claim)을 실제 정답(ground truth)에 대조하여 검증하며, 근거 기반 생성(grounded generation)을 평가하는 데 점차 널리 사용되고 있다. 우리는 이들 지표가 공통된 사각지대를 가지고 있음을 보인다. 즉, 이들은 정밀도(precision)만 측정한다(지식된 주장이 뒷받침되는가?). 따라서 모델이 거의 아무 말도 하지 않음으로써 완벽에 가까운 충실도를 얻을 수 있기 때문에, 발언 회피(abstention)에 보상을 준다. 우리는 이를 포뮬러 원 텔레메트리(Formula 1 telemetry)를 사용하여 측정 가능하게 만든다. 이 도메인에서는 전략적 실제 정답이 결정론적으로 도출되며, 결정적으로 완전하게(completely) 도출된다. 즉, 각 결정에 대해 중요했던 전체 사실 집합이 알려져 있다. 이러한 완전성(completeness)은 개방 도메인 충실도 벤치마크에서는 결여된 요소로, 우리가 정밀도와 함께 재현율(recall, 관련 사실의 포괄 범위)을 정확하게 측정할 수 있게 해준다. 150개 레이스에 걸친 7,253개의 결정 사례로 구성된 다국어(EN/ES/PT) 벤치마크에서, 가장 정밀한 최첨단 모델은 관련 사실의 절반 이하만을 포괄하며 F1 점수 기준 최하위를 기록했다. 따라서 포괄 범위(coverage)를 요구하면 시스템 순위가 재구성된다. 동일한 효과는 두 번째 완전 오라클(complete-oracle) 도메인(NOAA 기상 예보)에서도 재현된다. 프롬프트 제거 실험(prompt ablation)은 낮은 포괄 범위가 부족한 프롬프팅에 의한 인공물이 아님을 보여준다. 모델에게 철저할 것을 명시적으로 요청해도 그 격차가 좁혀지지 않는다. 우리는 충실도를 포괄 범위와 하나의 점수로 결합하고, 해당 지표를 검증한다(통제된 교란, 모델 없는 정규식 추출기와 교차 계열 LLM 추출기 간 일치도, 시스템 수준 스피어만 상관계수 1.0). 또한 참조 없이 정밀도와 재현율을 모두 향상시키는 검증자 기반 생성 방법(verifier-guided generation method)을 제시한다. 우리는 벤치마크, 구조화된 주석, 지표, 기준선 및 대화형 데모를 공개한다.

English

Reference-free faithfulness metrics verify each atomic claim a model makes against ground truth, and are increasingly used to evaluate grounded generation. We show they share a blind spot: they measure only precision -- are the stated claims supported? -- and therefore reward abstention, since a model can score near-perfect faithfulness by saying almost nothing. We make this measurable using Formula 1 telemetry, a domain where strategic ground truth is derived deterministically and, crucially, completely: for each decision we know the full set of facts that mattered. This completeness -- absent in open-domain faithfulness benchmarks -- lets us measure recall (coverage of the relevant facts) exactly, alongside precision. On a multilingual (EN/ES/PT) benchmark of 7,253 decision instances spanning 150 races, the most precise frontier model covers under half of the relevant facts and ranks last by F1, so requiring coverage reorders the systems; the same effect reappears in a second complete-oracle domain (NOAA weather forecasts). A prompt ablation shows the low coverage is not an under-prompting artifact: explicitly asking models to be thorough does not close the gap. We pair faithfulness with coverage into a single score, validate the metric (controlled perturbation; agreement across a model-free regex extractor and a cross-family LLM extractor, system-level Spearman 1.0), and give a verifier-guided generation method that improves precision and recall without references. We release the benchmark, structured annotations, metric, baselines, and an interactive demo.