超越静态工具:科学推理中的测试时工具演化
Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning
January 12, 2026
作者: Jiaxuan Lu, Ziyu Kong, Yemin Wang, Rong Fu, Haiyuan Wan, Cheng Yang, Wenjie Lou, Haoran Sun, Lilong Wang, Yankai Jiang, Xiaosong Wang, Xiao Sun, Dongzhan Zhou
cs.AI
摘要
科学智能的核心挑战不仅在于推理本身,更在于在开放的科学世界中创建计算方法的能力。现有基于大语言模型的智能体依赖静态预定义的工具库,这种范式在工具稀疏、异构且本质不完整的科学领域存在根本缺陷。本文提出测试时工具演化新范式,使智能体能够在推理过程中合成、验证并演化可执行工具。通过将工具从固定资源转变为问题驱动的产物,该范式克服了静态工具库的僵化性与长尾局限性。为支持严谨评估,我们构建了SciEvo基准数据集,包含1,590项科学推理任务及925个自动演化工具。大量实验表明,该范式在准确率与工具效率上均达到最先进水平,同时实现了计算工具的有效跨领域迁移。代码与基准数据集已发布于https://github.com/lujiaxuan0520/Test-Time-Tool-Evol。
English
The central challenge of AI for Science is not reasoning alone, but the ability to create computational methods in an open-ended scientific world. Existing LLM-based agents rely on static, pre-defined tool libraries, a paradigm that fundamentally fails in scientific domains where tools are sparse, heterogeneous, and intrinsically incomplete. In this paper, we propose Test-Time Tool Evolution (TTE), a new paradigm that enables agents to synthesize, verify, and evolve executable tools during inference. By transforming tools from fixed resources into problem-driven artifacts, TTE overcomes the rigidity and long-tail limitations of static tool libraries. To facilitate rigorous evaluation, we introduce SciEvo, a benchmark comprising 1,590 scientific reasoning tasks supported by 925 automatically evolved tools. Extensive experiments show that TTE achieves state-of-the-art performance in both accuracy and tool efficiency, while enabling effective cross-domain adaptation of computational tools. The code and benchmark have been released at https://github.com/lujiaxuan0520/Test-Time-Tool-Evol.