ChatPaper.aiChatPaper

SimpleQA认证:衡量参数化知识可靠性的基准测试平台

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

September 9, 2025
作者: Lukas Haas, Gal Yona, Giovanni D'Antonio, Sasha Goldshtein, Dipanjan Das
cs.AI

摘要

我们推出了SimpleQA Verified,这是一个包含1000个提示的基准测试集,用于评估基于OpenAI SimpleQA的大型语言模型(LLM)在简短事实性回答上的表现。该基准测试集针对OpenAI原有基准测试中的关键局限进行了改进,包括噪声与错误标签、主题偏见以及问题冗余等问题。SimpleQA Verified通过一个严格的多阶段筛选流程创建,涉及去重、主题平衡及来源核实,旨在生成一个更为可靠且具挑战性的评估集,同时对自动评分提示进行了优化。在这一新基准测试中,Gemini 2.5 Pro以55.6的F1分数达到了当前最先进水平,超越了包括GPT-5在内的其他前沿模型。此项工作为研究社区提供了一个高保真工具,以追踪参数模型在事实准确性上的真实进展,并有效减少幻觉现象。基准测试数据集、评估代码及排行榜可在以下网址获取:https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified。
English
We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.
PDF103September 10, 2025