GPQA:一個針對研究生級別的、無法透過 Google 解答的問答基準。
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
November 20, 2023
作者: David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman
cs.AI
摘要
我們提出了 GPQA,這是一個由生物學、物理學和化學領域專家撰寫的具有挑戰性的資料集,包含 448 個多重選擇題。我們確保這些問題是高質量且極具挑戰性的:在相應領域擁有或正在攻讀博士學位的專家們達到了 65% 的準確率(若不計專家事後辨認的明顯錯誤,準確率為 74%),而高技能的非專家驗證者僅達到 34% 的準確率,儘管他們平均花費超過 30 分鐘並可無限制地使用網路(即這些問題是「防 Google」的)。這些問題對於最先進的人工智慧系統也具有挑戰性,我們基於最強的 GPT-4 基線僅達到 39% 的準確率。如果我們要利用未來的人工智慧系統來幫助我們回答非常困難的問題,例如在開發新科學知識時,我們需要發展可擴展的監督方法,使人類能夠監督其輸出,即使監督者本身具有技能和知識也可能很困難。對於高技能的非專家和前沿人工智慧系統來說,GPQA 的困難性應該能夠進行現實可擴展的監督實驗,我們希望這能幫助制定方法,讓人類專家能夠可靠地從超越人類能力的人工智慧系統中獲得真實信息。
English
We present GPQA, a challenging dataset of 448 multiple-choice questions
written by domain experts in biology, physics, and chemistry. We ensure that
the questions are high-quality and extremely difficult: experts who have or are
pursuing PhDs in the corresponding domains reach 65% accuracy (74% when
discounting clear mistakes the experts identified in retrospect), while highly
skilled non-expert validators only reach 34% accuracy, despite spending on
average over 30 minutes with unrestricted access to the web (i.e., the
questions are "Google-proof"). The questions are also difficult for
state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving
39% accuracy. If we are to use future AI systems to help us answer very hard
questions, for example, when developing new scientific knowledge, we need to
develop scalable oversight methods that enable humans to supervise their
outputs, which may be difficult even if the supervisors are themselves skilled
and knowledgeable. The difficulty of GPQA both for skilled non-experts and
frontier AI systems should enable realistic scalable oversight experiments,
which we hope can help devise ways for human experts to reliably get truthful
information from AI systems that surpass human capabilities.