ChatPaper.aiChatPaper

FACTS权威榜单:大型语言模型事实准确性综合基准评测

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

December 11, 2025
作者: Aileen Cheng, Alon Jacovi, Amir Globerson, Ben Golan, Charles Kwong, Chris Alberti, Connie Tao, Eyal Ben-David, Gaurav Singh Tomar, Lukas Haas, Yonatan Bitton, Adam Bloniarz, Aijun Bai, Andrew Wang, Anfal Siddiqui, Arturo Bajuelos Castillo, Aviel Atias, Chang Liu, Corey Fry, Daniel Balle, Deepanway Ghosal, Doron Kukliansky, Dror Marcus, Elena Gribovskaya, Eran Ofek, Honglei Zhuang, Itay Laish, Jan Ackermann, Lily Wang, Meg Risdal, Megan Barnes, Michael Fink, Mohamed Amin, Moran Ambar, Natan Potikha, Nikita Gupta, Nitzan Katz, Noam Velan, Ofir Roval, Ori Ram, Polina Zablotskaia, Prathamesh Bang, Priyanka Agrawal, Rakesh Ghiya, Sanjay Ganapathy, Simon Baumgartner, Sofia Erell, Sushant Prakash, Thibault Sellam, Vikram Rao, Xuanhui Wang, Yaroslav Akulov, Yulong Yang, Zhen Yang, Zhixin Lai, Zhongru Wu, Anca Dragan, Avinatan Hassidim, Fernando Pereira, Slav Petrov, Srinivasan Venkatachary, Tulsee Doshi, Yossi Matias, Sasha Goldshtein, Dipanjan Das
cs.AI

摘要

我们推出FACTS评估体系——一套在线排行榜及关联基准测试,旨在全面评估语言模型在不同场景下生成事实准确文本的能力。该体系通过聚合模型在四个独立子榜单上的表现提供整体事实性度量:(1)FACTS多模态榜,衡量基于图像提问的响应事实性;(2)FACTS参数知识榜,通过闭卷事实问答评估模型内部参数蕴含的世界知识;(3)FACTS搜索应用榜,评估信息检索场景中模型使用搜索API时的事实准确性;(4)FACTS文本锚定榜(v2版),评估长文本回答是否基于给定文档进行锚定,其判定模型得到显著优化。各子榜单均采用自动化判定模型对回答进行评分,最终体系得分为四项得分的平均值,从而实现对模型整体事实性的稳健均衡评估。FACTS评估体系将持续更新维护,包含公开与私有测试集以兼顾公众参与及系统完整性,详情可见:https://www.kaggle.com/benchmarks/google/facts。
English
We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at https://www.kaggle.com/benchmarks/google/facts .
PDF31December 13, 2025