ChatPaper.aiChatPaper

CoverBench:一個針對複雜主張驗證的具挑戰性基準测试

CoverBench: A Challenging Benchmark for Complex Claim Verification

August 6, 2024
作者: Alon Jacovi, Moran Ambar, Eyal Ben-David, Uri Shaham, Amir Feder, Mor Geva, Dror Marcus, Avi Caciularu
cs.AI

摘要

近年來,關於驗證語言模型輸出正確性的研究日益增加。與此同時,語言模型被用於應對需要推理的複雜查詢。我們介紹了CoverBench,這是一個專注於在複雜推理環境中驗證語言模型輸出的具有挑戰性的基準。為此目的設計的數據集通常用於其他複雜推理任務(例如問答),針對特定用例(例如財務表)進行設計,需要轉換、負樣本抽樣和難例選擇來收集這樣的基準。CoverBench為各種領域、推理類型、相對較長的輸入以及多種標準化提供了多樣化的複雜主張驗證評估,例如在可用時為表格提供多種表示形式,以及一致的架構。我們通過手動審查數據質量,以確保標籤噪音水平低。最後,我們報告了各種具有競爭力的基準結果,以顯示CoverBench具有挑戰性並具有非常顯著的提升空間。數據可在https://huggingface.co/datasets/google/coverbench 下載。
English
There is a growing line of research on verifying the correctness of language models' outputs. At the same time, LMs are being used to tackle complex queries that require reasoning. We introduce CoverBench, a challenging benchmark focused on verifying LM outputs in complex reasoning settings. Datasets that can be used for this purpose are often designed for other complex reasoning tasks (e.g., QA) targeting specific use-cases (e.g., financial tables), requiring transformations, negative sampling and selection of hard examples to collect such a benchmark. CoverBench provides a diversified evaluation for complex claim verification in a variety of domains, types of reasoning, relatively long inputs, and a variety of standardizations, such as multiple representations for tables where available, and a consistent schema. We manually vet the data for quality to ensure low levels of label noise. Finally, we report a variety of competitive baseline results to show CoverBench is challenging and has very significant headroom. The data is available at https://huggingface.co/datasets/google/coverbench .

Summary

AI-Generated Summary

PDF152November 28, 2024