ACADREASON:探索推理模型在学术研究问题中的极限
ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems
October 13, 2025
作者: Xin Gui, King Zhu, JinCheng Ren, Qianben Chen, Zekun Moore Wang, Yizhi LI, Xinpeng Liu, Xiaowan Li, Wenli Ren, Linyu Miao, Tianrui Qin, Ziqi Shu, He Zhu, Xiangru Tang, Dingfeng Shi, Jiaheng Liu, Yuchen Eleanor Jiang, Minghao Liu, Ge Zhang, Wangchunshu Zhou
cs.AI
摘要
近年来,大型语言模型(LLMs)与智能代理的研究重心已逐渐从展示新颖能力转向复杂推理与应对高难度任务。然而,现有评估主要集中于数学/编程竞赛或通用任务,而现有的多领域学术基准缺乏足够的推理深度,导致该领域缺乏针对高级推理的严格基准。为填补这一空白,我们推出了Acadreason基准,旨在评估LLMs与智能代理在获取和推理学术知识方面的能力。该基准包含50个由专家标注的学术问题,涵盖计算机科学、经济学、法学、数学和哲学五大高推理领域。所有问题均源自近年顶级出版物,并经过严格的标注与质量控制,确保其既具挑战性又可解答。我们对超过10种主流LLMs与智能代理进行了系统评估。结果显示,大多数LLMs得分低于20分,即便是最先进的GPT-5也仅获得16分。虽然智能代理得分较高,但无一超过40分。这揭示了LLMs与智能代理在超智能学术研究任务中的现有能力差距,并凸显了Acadreason基准的挑战性。
English
In recent years, the research focus of large language models (LLMs) and
agents has shifted increasingly from demonstrating novel capabilities to
complex reasoning and tackling challenging tasks. However, existing evaluations
focus mainly on math/code contests or general tasks, while existing
multi-domain academic benchmarks lack sufficient reasoning depth, leaving the
field without a rigorous benchmark for high-level reasoning. To fill this gap,
we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs
and agents to acquire and reason over academic knowledge. It consists of 50
expert-annotated academic problems across five high-reasoning domains,
including computer science, economics, law, mathematics, and philosophy. All
questions are sourced from top-tier publications in recent years and undergo
rigorous annotation and quality control to ensure they are both challenging and
answerable. We conduct systematic evaluations of over 10 mainstream LLMs and
agents. The results show that most LLMs scored below 20 points, with even the
cutting-edge GPT-5 achieving only 16 points. While agents achieved higher
scores, none exceeded 40 points. This demonstrates the current capability gap
between LLMs and agents in super-intelligent academic research tasks and
highlights the challenges of Acadreason.