SciAgentGym:大语言模型代理中多步科学工具使用的基准测试框架
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
February 13, 2026
作者: Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang
cs.AI
摘要
科學推理本質上要求整合複雜工具集以駕馭領域特定知識。然而現有基準大多忽視了智能體在嚴謹工作流中協調工具的能力。為彌合這一差距,我們推出SciAgentGym——一個具備可擴展交互環境的系統,涵蓋四大自然科學領域的1,780項領域專用工具,並由強健的執行基礎架構支撐。與此配套,我們提出SciAgentBench分層評估套件,旨在從基礎操作到長程工作流全方位壓力測試智能體能力。評估揭示關鍵瓶頸:頂尖模型在複雜科學工具使用上表現堪憂。以GPT-5為例,其成功率隨交互跨度延長從60.6%銳減至30.9%,主因在於多步驟工作流執行失敗。為此我們創建SciForge數據合成方法,通過將工具動作空間建模為依賴圖來生成邏輯感知的訓練軌跡。基於這些軌跡微調後,我們的SciAgent-8B模型不僅超越規模大得多的Qwen3-VL-235B-Instruct,更展現出科學工具使用能力的跨領域正向遷移。這些發現凸顯了新一代自主科學智能體的發展潛力。
English
Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.