SuperGPQA：跨285個研究生領域擴展大型語言模型評估

摘要

大型語言模型（LLMs）在數學、物理和計算機科學等主流學術領域展現了卓越的能力。然而，人類知識涵蓋超過200個專業學科，遠超現有基準的範圍。LLMs在許多這些專業領域——特別是輕工業、農業和服務導向學科——的能力仍未被充分評估。為填補這一空白，我們提出了SuperGPQA，這是一個全面評估285個學科研究生級別知識與推理能力的基準。我們的基準採用了一種新穎的人機協同過濾機制，通過基於LLM回應和專家反饋的迭代精煉，剔除瑣碎或模糊的問題。實驗結果顯示，當前最先進的LLMs在多樣化知識領域的表現仍有顯著提升空間（例如，以推理為核心的模型DeepSeek-R1在SuperGPQA上達到了61.82%的最高準確率），凸顯了當前模型能力與人工通用智能之間的巨大差距。此外，我們還分享了管理大規模註釋過程的全面見解，涉及超過80位專家註釋員和一個互動式人機協同系統，為未來類似規模的研究項目提供了寶貴的方法論指導。

English

Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

SuperGPQA：跨285個研究生領域擴展大型語言模型評估

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

摘要

Support