Step-DeepResearch技术报告
Step-DeepResearch Technical Report
December 23, 2025
作者: Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu Liu, Jing Bai, Junlan Liu, Manjiao Liu, Na Wang, Qiuping Wu, Qinxin Du, Shiwei Li, Wen Sun, Yifeng Gong, Yonglin Chen, Yuling Zhao, Yuxuan Lin, Ziqi Ren, Zixuan Wang, Aihu Zhang, Brian Li, Buyun Ma, Kang An, Li Xie, Mingliang Li, Pan Li, Shidong Yang, Xi Chen, Xiaojia Liu, Yuchu Luo, Yuan Song, YuanHao Ding, Yuanwei Liang, Zexi Li, Zhaoning Zhang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu
cs.AI
摘要
随着大语言模型向自主智能体演进,深度研究能力已成为关键评估指标。然而现有学术基准(如BrowseComp)往往难以满足开放域研究的实际需求,这类研究需要强大的意图识别、长程决策和跨源验证能力。为此,我们推出具有成本效益的端到端智能体Step-DeepResearch,提出基于原子能力的数据合成策略以强化规划与报告撰写能力,并结合从智能体中期训练到SFT与RL的渐进式训练路径。通过引入清单式评判器增强系统鲁棒性,同时针对中文领域的评估空白,建立了面向真实深度研究场景的ADR-Bench评测体系。实验表明,Step-DeepResearch(32B)在Scale AI研究量表中获得61.4%得分,在ADR-Bench上显著超越同规模模型,并与OpenAI、Gemini DeepResearch等闭源SOTA模型性能相当。这些证明通过精细化训练,中等规模模型能以业界领先的性价比实现专家级研究能力。
English
As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.