ChatPaper.aiChatPaper

ACADREASON:探索推理模型在学术研究问题中的极限

ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

October 13, 2025
作者: Xin Gui, King Zhu, JinCheng Ren, Qianben Chen, Zekun Moore Wang, Yizhi LI, Xinpeng Liu, Xiaowan Li, Wenli Ren, Linyu Miao, Tianrui Qin, Ziqi Shu, He Zhu, Xiangru Tang, Dingfeng Shi, Jiaheng Liu, Yuchen Eleanor Jiang, Minghao Liu, Ge Zhang, Wangchunshu Zhou
cs.AI

摘要

近年來,大型語言模型(LLMs)與智能代理的研究焦點已逐漸從展示新穎能力轉向複雜推理與應對挑戰性任務。然而,現有的評估主要集中於數學/編程競賽或一般性任務,而現有的多領域學術基準則缺乏足夠的推理深度,使得該領域缺乏針對高階推理的嚴謹基準。為填補這一空白,我們引入了Acadreason基準,旨在評估LLMs與智能代理獲取並基於學術知識進行推理的能力。該基準由50個專家註釋的學術問題組成,涵蓋計算機科學、經濟學、法學、數學及哲學五個高推理領域。所有問題均源自近年來頂級出版物,並經過嚴格的註釋與質量控制,以確保其既具挑戰性又可解答。我們對超過10種主流LLMs與智能代理進行了系統性評估。結果顯示,大多數LLMs得分低於20分,即便是最先進的GPT-5也僅獲得16分。雖然智能代理得分較高,但無一超過40分。這表明當前LLMs與智能代理在超智能學術研究任務中的能力差距,並凸顯了Acadreason的挑戰性。
English
In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge. It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations of over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of Acadreason.
PDF262October 14, 2025