CoverBench: 복잡한 주장 검증을 위한 도전적인 벤치마크

초록

언어 모델의 출력의 정확성을 검증하는 연구가 증가하고 있습니다. 동시에, LM은 추론이 필요한 복잡한 쿼리를 해결하는 데 사용되고 있습니다. 우리는 복잡한 추론 환경에서 LM 출력을 검증하는 데 초점을 맞춘 도전적인 벤치마크인 CoverBench를 소개합니다. 이를 위해 사용할 수 있는 데이터셋은 종종 다른 복잡한 추론 작업 (예: QA)을 위해 설계되어 특정 유증상 (예: 금융 테이블)을 대상으로 하며, 변환, 부정적 샘플링 및 어려운 예제의 선택이 필요합니다. CoverBench는 다양한 도메인, 추론 유형, 상대적으로 긴 입력 및 다양한 표준화를 제공하여 복잡한 주장 검증에 대한 평가를 다양화합니다. 가능한 경우 표의 다양한 표현과 일관된 스키마를 제공합니다. 저희는 데이터의 품질을 수동으로 확인하여 라벨 노이즈를 최소화합니다. 마지막으로, CoverBench가 도전적이며 매우 큰 잠재력을 가지고 있음을 보여주기 위해 다양한 경쟁력 있는 기준 결과를 보고합니다. 데이터는 https://huggingface.co/datasets/google/coverbench 에서 사용할 수 있습니다.

English

There is a growing line of research on verifying the correctness of language models' outputs. At the same time, LMs are being used to tackle complex queries that require reasoning. We introduce CoverBench, a challenging benchmark focused on verifying LM outputs in complex reasoning settings. Datasets that can be used for this purpose are often designed for other complex reasoning tasks (e.g., QA) targeting specific use-cases (e.g., financial tables), requiring transformations, negative sampling and selection of hard examples to collect such a benchmark. CoverBench provides a diversified evaluation for complex claim verification in a variety of domains, types of reasoning, relatively long inputs, and a variety of standardizations, such as multiple representations for tables where available, and a consistent schema. We manually vet the data for quality to ensure low levels of label noise. Finally, we report a variety of competitive baseline results to show CoverBench is challenging and has very significant headroom. The data is available at https://huggingface.co/datasets/google/coverbench .

CoverBench: 복잡한 주장 검증을 위한 도전적인 벤치마크

CoverBench: A Challenging Benchmark for Complex Claim Verification

초록

Support