逐步深度研究技術報告
Step-DeepResearch Technical Report
December 23, 2025
作者: Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu Liu, Jing Bai, Junlan Liu, Manjiao Liu, Na Wang, Qiuping Wu, Qinxin Du, Shiwei Li, Wen Sun, Yifeng Gong, Yonglin Chen, Yuling Zhao, Yuxuan Lin, Ziqi Ren, Zixuan Wang, Aihu Zhang, Brian Li, Buyun Ma, Kang An, Li Xie, Mingliang Li, Pan Li, Shidong Yang, Xi Chen, Xiaojia Liu, Yuchu Luo, Yuan Song, YuanHao Ding, Yuanwei Liang, Zexi Li, Zhaoning Zhang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu
cs.AI
摘要
隨著大型語言模型逐漸轉向自主智慧體發展,深度研究能力已成為關鍵評估指標。然而現有的學術基準(如BrowseComp)往往難以滿足開放式研究的實際需求,這類研究需要具備精準的意圖識別、長週期決策和跨來源驗證能力。為此,我們推出Step-DeepResearch——一個具備成本效益的端到端智慧體。我們提出基於原子能力的數據合成策略,通過從智慧體中期訓練到SFT與RL的漸進式學習路徑,強化規劃與報告撰寫能力,並結合清單式評判機制顯著提升系統魯棒性。針對中文領域的評估空白,我們進一步構建了貼近真實場景的ADR-Bench評估體系。實驗結果表明,Step-DeepResearch(32B)在Scale AI研究量規中獲得61.4%的評分,在ADR-Bench上顯著超越同規模模型,並可與OpenAI、Gemini DeepResearch等閉源SOTA模型媲美。這些成果證明,通過精細化訓練策略,中型模型能夠以業界領先的性價比實現專家級深度研究能力。
English
As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.