ChatPaper.aiChatPaper

生成語言模型事實性評估的基準。

Generating Benchmarks for Factuality Evaluation of Language Models

July 13, 2023
作者: Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, Yoav Shoham
cs.AI

摘要

在特定領域部署語言模型(LM)之前,重要的是評估其在該領域生成事實錯誤信息的趨勢。現有的事實生成評估方法聚焦於從LM本身抽樣的事實,因此無法控制評估的事實集,可能會低估罕見和不太可能的事實。我們提出了FACTOR:通過語料庫轉換進行事實評估,這是一種用於評估LM事實性的可擴展方法。FACTOR自動將感興趣的事實語料庫轉換為一個基準,評估LM生成來自該語料庫的真實事實與類似但不正確的陳述的傾向。我們使用我們的框架創建了兩個基準:Wiki-FACTOR和News-FACTOR。我們表明:(i)我們的基準分數隨著模型大小增加而增加,在LM添加檢索時得到改善;(ii)基準分數與困惑度相關,但這兩個指標在模型排名上並不總是一致;以及(iii)當困惑度和基準分數不一致時,後者更能反映人類注釋者所測量的開放式生成中的事實性。我們將我們的數據和代碼公開提供在https://github.com/AI21Labs/factor。
English
Before deploying a language model (LM) within a given domain, it is important to measure its tendency to generate factually incorrect information in that domain. Existing factual generation evaluation methods focus on facts sampled from the LM itself, and thus do not control the set of evaluated facts and might under-represent rare and unlikely facts. We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality. FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements. We use our framework to create two benchmarks: Wiki-FACTOR and News-FACTOR. We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score correlates with perplexity, but the two metrics do not always agree on model ranking; and (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation, as measured by human annotators. We make our data and code publicly available in https://github.com/AI21Labs/factor.
PDF80December 15, 2024