ChatPaper.aiChatPaper

SciMaster:邁向通用科學AI代理,第一部 以X-Master為基礎:我們能否引領人類的最終考驗?

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?

July 7, 2025
作者: Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Yuzhi Zhang, Linfeng Zhang, Siheng Chen
cs.AI

摘要

AI代理的快速發展點燃了長期以來利用其加速科學發現的雄心。實現這一目標需要對人類知識前沿的深刻理解。因此,「人類終極考試」(HLE)為評估科學AI代理提供了一個極具挑戰性的試金石。在本研究中,我們旨在構建通用代理的基礎架構,並通過在HLE上的領先表現來驗證其能力。為此,我們引入了X-Master,這是一個工具增強型推理代理,旨在通過在推理過程中靈活與外部工具互動來模擬人類研究人員。該代理以代碼作為互動語言的概念為指導,能夠靈活利用內置的Python庫和我們定制的工具來增強推理能力。我們進一步通過X-Masters擴展其能力,這是一種分散與堆疊的代理工作流程,系統性地提升推理的廣度和深度。我們開源的解決方案X-Masters在HLE上創下了32.1%的新紀錄,超越了OpenAI和Google的深度研究(分別為26.6%和26.9%),並首次突破30%的門檻。這項工作使我們能夠更深入地理解複雜任務的解決,並積累寶貴的經驗,為未來的進步提供指導,引導後續的模型訓練。
English
The rapid advancements of AI agents have ignited the long-held ambition of leveraging them to accelerate scientific discovery. Achieving this goal requires a deep understanding of the frontiers of human knowledge. As such, Humanity's Last Exam (HLE) provides an exceptionally challenging touchstone for evaluating scientific AI agents. In this work, we aim to construct the foundational architecture for general-purpose agents and validate the capabilities through leading performance on HLE. To achieve this, we introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers by interacting flexibly with external tools during its reasoning process. This agent, guided by the conceptualization of code as an interaction language, can flexibly leverage built-in Python libraries and our customized tools to augment the reasoning. We further scale its capabilities through X-Masters, a scattered-and-stacked agentic workflow that systematically enhances breadth and depth of reasoning. Our open-source solution, X-Masters, sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing OpenAI's and Google's Deep Research (26.6% and 26.9%) and becoming the first to exceed the 30% threshold. This work allows us to gain a deeper understanding of complex task-solving and accumulates valuable experience that can inform future advancements, guiding subsequent model training.
PDF21July 11, 2025