SciAgentGym:面向大语言模型代理的多步骤科学工具使用基准测试平台
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
February 13, 2026
作者: Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang
cs.AI
摘要
科学推理本质上要求整合复杂工具集以驾驭领域特定知识。然而现有基准测试大多忽视了智能体在严格工作流中协调工具的能力。为填补这一空白,我们推出SciAgentGym——一个可扩展的交互环境,涵盖四大自然科学领域的1,780种领域专用工具,并配备稳健的执行基础设施。与之配套的SciAgentBench分层评估体系,旨在对智能体能力进行从基础操作到长周期工作流的压力测试。评估揭示关键瓶颈:顶尖模型在复杂科学工具使用上表现堪忧。以GPT-5为例,其成功率随交互周期延长从60.6%骤降至30.9%,主因在于多步骤工作流执行失败。为此我们提出SciForge数据合成方法,通过将工具动作空间建模为依赖图来生成逻辑感知的训练轨迹。基于这些轨迹微调的SciAgent-8B模型,在超越体积大得多的Qwen3-VL-235B-Instruct的同时,展现出科学工具使用能力的正向跨领域迁移。这些成果彰显了新一代自主科学智能体的巨大潜力。
English
Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.