ChatPaper.aiChatPaper

SimpleQA Verified:一個可靠的實證基準,用於衡量參數化知識

SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

September 9, 2025
作者: Lukas Haas, Gal Yona, Giovanni D'Antonio, Sasha Goldshtein, Dipanjan Das
cs.AI

摘要

我們推出了SimpleQA Verified,這是一個包含1,000個提示的基準測試集,用於評估大型語言模型(LLM)在簡短事實性回答上的表現,該測試集基於OpenAI的SimpleQA。它解決了OpenAI基準測試中的關鍵限制,包括噪聲和錯誤標籤、主題偏差以及問題重複性。SimpleQA Verified通過嚴謹的多階段過濾流程創建,涉及去重、主題平衡和來源核對,以產生更可靠且更具挑戰性的評估集,同時改進了自動評分提示。在這個新基準上,Gemini 2.5 Pro達到了55.6的F1分數,處於領先地位,超越了包括GPT-5在內的其他前沿模型。這項工作為研究社群提供了一個更高保真度的工具,以追蹤參數模型在事實性方面的真實進展,並減少幻覺現象。基準測試數據集、評估代碼和排行榜可在以下網址獲取:https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified。
English
We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.
PDF103September 10, 2025