GPQA：大学院レベルのGoogle耐性Q&Aベンチマーク

要旨

本論文では、生物学、物理学、化学の各分野の専門家によって作成された448の多肢選択問題からなる挑戦的なデータセット「GPQA」を紹介する。本データセットは、高品質かつ極めて難易度の高い問題を確保しており、対応する分野で博士号を取得中または取得済みの専門家でも正答率は65％（専門家が後から明らかな誤りと特定した問題を除くと74％）に留まる。一方、高度なスキルを持つ非専門家の検証者は、平均30分以上をかけ、ウェブへの無制限アクセスを許可されたにもかかわらず、正答率は34％に留まる（すなわち、これらの問題は「Google耐性」である）。また、最先端のAIシステムにとっても難易度が高く、我々が構築した最も強力なGPT-4ベースのベースラインでも正答率は39％である。将来のAIシステムを、例えば新たな科学的知見の開発といった非常に難しい問題の解決に活用するためには、人間がその出力を監督可能なスケーラブルな監視手法を開発する必要がある。これは、監督者自身が高度なスキルと知識を有している場合でも困難な課題である。GPQAの難易度は、熟練した非専門家と最先端のAIシステムの両方にとって現実的なスケーラブルな監視実験を可能にし、人間の専門家がAIシステムから信頼性の高い真実の情報を得る方法を考案する一助となることを期待する。

English

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

GPQA：大学院レベルのGoogle耐性Q&Aベンチマーク

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

要旨

Support