超越靜態工具:科學推理中的測試時工具演化
Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning
January 12, 2026
作者: Jiaxuan Lu, Ziyu Kong, Yemin Wang, Rong Fu, Haiyuan Wan, Cheng Yang, Wenjie Lou, Haoran Sun, Lilong Wang, Yankai Jiang, Xiaosong Wang, Xiao Sun, Dongzhan Zhou
cs.AI
摘要
科學人工智慧的核心挑戰不僅在於推理本身,更在於在開放式科學世界中創建計算方法的能力。現有基於大型語言模型的智慧體依賴靜態、預定義的工具庫,這種範式在工具稀缺、異質且本質不完整的科學領域從根本上難以適用。本文提出「推論期工具演化」新範式,使智慧體能在推理過程中合成、驗證並演化可執行工具。通過將工具從固定資源轉變為問題驅動的產物,TTE克服了靜態工具庫的僵化性與長尾局限性。為實現嚴謹評估,我們建立SciEvo基準數據集,包含1,590項科學推理任務及925個自動演化工具的支持。大量實驗表明,TTE在準確率與工具效率方面均達到最先進水平,同時實現了計算工具的有效跨領域適配。程式碼與基準數據集已開源於:https://github.com/lujiaxuan0520/Test-Time-Tool-Evol。
English
The central challenge of AI for Science is not reasoning alone, but the ability to create computational methods in an open-ended scientific world. Existing LLM-based agents rely on static, pre-defined tool libraries, a paradigm that fundamentally fails in scientific domains where tools are sparse, heterogeneous, and intrinsically incomplete. In this paper, we propose Test-Time Tool Evolution (TTE), a new paradigm that enables agents to synthesize, verify, and evolve executable tools during inference. By transforming tools from fixed resources into problem-driven artifacts, TTE overcomes the rigidity and long-tail limitations of static tool libraries. To facilitate rigorous evaluation, we introduce SciEvo, a benchmark comprising 1,590 scientific reasoning tasks supported by 925 automatically evolved tools. Extensive experiments show that TTE achieves state-of-the-art performance in both accuracy and tool efficiency, while enabling effective cross-domain adaptation of computational tools. The code and benchmark have been released at https://github.com/lujiaxuan0520/Test-Time-Tool-Evol.