CoverBench：一个用于复杂主张验证的具有挑战性的基准测试

摘要

关于验证语言模型输出正确性的研究日益增多。同时，语言模型被用于处理需要推理的复杂查询。我们介绍了CoverBench，这是一个专注于在复杂推理环境中验证语言模型输出的具有挑战性的基准测试。用于此目的的数据集通常设计用于其他复杂推理任务（例如问答），针对特定用例（例如财务表），需要进行转换、负采样和选择困难示例来收集这样一个基准测试。CoverBench为各种领域、推理类型、相对较长的输入以及多种标准化提供了多样化的复杂主张验证评估，例如在可用的情况下为表格提供多种表示，并保持一致的模式。我们手动审核数据以确保标签噪声水平较低。最后，我们报告了各种具有竞争力的基准结果，以展示CoverBench具有挑战性并具有非常显著的潜力。数据可在https://huggingface.co/datasets/google/coverbench 获取。

English

There is a growing line of research on verifying the correctness of language models' outputs. At the same time, LMs are being used to tackle complex queries that require reasoning. We introduce CoverBench, a challenging benchmark focused on verifying LM outputs in complex reasoning settings. Datasets that can be used for this purpose are often designed for other complex reasoning tasks (e.g., QA) targeting specific use-cases (e.g., financial tables), requiring transformations, negative sampling and selection of hard examples to collect such a benchmark. CoverBench provides a diversified evaluation for complex claim verification in a variety of domains, types of reasoning, relatively long inputs, and a variety of standardizations, such as multiple representations for tables where available, and a consistent schema. We manually vet the data for quality to ensure low levels of label noise. Finally, we report a variety of competitive baseline results to show CoverBench is challenging and has very significant headroom. The data is available at https://huggingface.co/datasets/google/coverbench .

CoverBench：一个用于复杂主张验证的具有挑战性的基准测试

CoverBench: A Challenging Benchmark for Complex Claim Verification

摘要

Support