生成用于评估语言模型事实性的基准测试
Generating Benchmarks for Factuality Evaluation of Language Models
July 13, 2023
作者: Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, Yoav Shoham
cs.AI
摘要
在将语言模型(LM)部署到特定领域之前,重要的是要衡量其在该领域生成事实错误信息的倾向。现有的事实生成评估方法侧重于从LM本身采样的事实,因此无法控制评估事实集,并可能低估罕见和不太可能的事实。我们提出了FACTOR:通过语料库转换进行事实评估,这是一种可伸缩的方法,用于评估LM的事实性。FACTOR会自动将感兴趣的事实语料库转换为一个基准,评估LM生成来自语料库的真实事实与类似但不正确的陈述的倾向。我们使用我们的框架创建了两个基准:Wiki-FACTOR和News-FACTOR。我们表明:(i)我们的基准分数随着模型大小增加而提高,并且当LM与检索相结合时得到改善;(ii)基准分数与困惑度相关,但这两个指标在模型排名上并不总是一致;以及(iii)当困惑度和基准分数不一致时,后者更能反映开放式生成中的事实性,这是由人类注释者测量的。我们在https://github.com/AI21Labs/factor上公开提供我们的数据和代码。
English
Before deploying a language model (LM) within a given domain, it is important
to measure its tendency to generate factually incorrect information in that
domain. Existing factual generation evaluation methods focus on facts sampled
from the LM itself, and thus do not control the set of evaluated facts and
might under-represent rare and unlikely facts. We propose FACTOR: Factual
Assessment via Corpus TransfORmation, a scalable approach for evaluating LM
factuality. FACTOR automatically transforms a factual corpus of interest into a
benchmark evaluating an LM's propensity to generate true facts from the corpus
vs. similar but incorrect statements. We use our framework to create two
benchmarks: Wiki-FACTOR and News-FACTOR. We show that: (i) our benchmark scores
increase with model size and improve when the LM is augmented with retrieval;
(ii) benchmark score correlates with perplexity, but the two metrics do not
always agree on model ranking; and (iii) when perplexity and benchmark score
disagree, the latter better reflects factuality in open-ended generation, as
measured by human annotators. We make our data and code publicly available in
https://github.com/AI21Labs/factor.