SciMaster:迈向通用科学AI代理,第一部分 以X-Master为基石:我们能否引领人类终极考验?
SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?
July 7, 2025
作者: Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, Siheng Chen
cs.AI
摘要
人工智能代理的快速发展点燃了长期以来的雄心——利用它们加速科学发现。实现这一目标需要深刻理解人类知识的前沿。因此,“人类终极考试”(HLE)为评估科学AI代理提供了一个极具挑战性的试金石。在本研究中,我们致力于构建通用代理的基础架构,并通过在HLE上的领先表现验证其能力。为此,我们引入了X-Master,一个工具增强的推理代理,旨在通过在其推理过程中灵活地与外部工具互动来模拟人类研究者。该代理以代码作为交互语言的概念为指导,能够灵活利用内置的Python库及我们定制的工具来增强推理能力。我们进一步通过X-Masters——一个分散与堆叠的代理工作流——扩展其能力,系统性地提升推理的广度和深度。我们的开源解决方案X-Masters在HLE上以32.1%的得分创下了新的最先进记录,超越了OpenAI和谷歌的深度研究(分别为26.6%和26.9%),并成为首个突破30%门槛的系统。这项工作使我们能更深入地理解复杂任务解决,并积累了宝贵的经验,为未来的进步提供指导,引领后续模型训练。
English
The rapid advancements of AI agents have ignited the long-held ambition of
leveraging them to accelerate scientific discovery. Achieving this goal
requires a deep understanding of the frontiers of human knowledge. As such,
Humanity's Last Exam (HLE) provides an exceptionally challenging touchstone for
evaluating scientific AI agents. In this work, we aim to construct the
foundational architecture for general-purpose agents and validate the
capabilities through leading performance on HLE. To achieve this, we introduce
X-Master, a tool-augmented reasoning agent designed to emulate human
researchers by interacting flexibly with external tools during its reasoning
process. This agent, guided by the conceptualization of code as an interaction
language, can flexibly leverage built-in Python libraries and our customized
tools to augment the reasoning. We further scale its capabilities through
X-Masters, a scattered-and-stacked agentic workflow that systematically
enhances breadth and depth of reasoning. Our open-source solution, X-Masters,
sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing
OpenAI's and Google's Deep Research (26.6% and 26.9%) and becoming the first to
exceed the 30% threshold. This work allows us to gain a deeper understanding of
complex task-solving and accumulates valuable experience that can inform future
advancements, guiding subsequent model training.