ACADREASON：探索推理模型在学术研究问题中的极限

摘要

近年来，大型语言模型（LLMs）与智能代理的研究重心已逐渐从展示新颖能力转向复杂推理与应对高难度任务。然而，现有评估主要集中于数学/编程竞赛或通用任务，而现有的多领域学术基准缺乏足够的推理深度，导致该领域缺乏针对高级推理的严格基准。为填补这一空白，我们推出了Acadreason基准，旨在评估LLMs与智能代理在获取和推理学术知识方面的能力。该基准包含50个由专家标注的学术问题，涵盖计算机科学、经济学、法学、数学和哲学五大高推理领域。所有问题均源自近年顶级出版物，并经过严格的标注与质量控制，确保其既具挑战性又可解答。我们对超过10种主流LLMs与智能代理进行了系统评估。结果显示，大多数LLMs得分低于20分，即便是最先进的GPT-5也仅获得16分。虽然智能代理得分较高，但无一超过40分。这揭示了LLMs与智能代理在超智能学术研究任务中的现有能力差距，并凸显了Acadreason基准的挑战性。

English

In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge. It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations of over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of Acadreason.

ACADREASON：探索推理模型在学术研究问题中的极限

ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

摘要

Support