ChatPaper.aiChatPaper

GPQA:一个面向研究生水平的谷歌无法解答的问答基准。

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

November 20, 2023
作者: David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman
cs.AI

摘要

我们提出了GPQA,这是一个由生物学、物理学和化学领域的专家撰写的448个多项选择题的挑战性数据集。我们确保这些问题是高质量且极具挑战性的:在对应领域拥有或正在攻读博士学位的专家们的准确率达到65%(如果不计入专家事后识别的明显错误,则为74%),而高技能的非专家验证者仅达到34%的准确率,尽管他们平均花费超过30分钟并可以无限制地访问网络(即这些问题是“防谷歌”的)。这些问题对于最先进的AI系统也很困难,我们基于最强大的GPT-4基线的准确率仅为39%。如果我们要使用未来的AI系统来帮助我们回答非常困难的问题,例如在开发新的科学知识时,我们需要开发可扩展的监督方法,使人类能够监督他们的输出,即使监督者本身是熟练和有知识的也可能很困难。对于高技能的非专家和前沿AI系统来说,GPQA的困难程度应该能够进行现实可扩展的监督实验,我们希望这可以帮助设计出让人类专家能够可靠地从超越人类能力的AI系统中获取真实信息的方法。
English
We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.
PDF312December 15, 2024