ChatPaper.aiChatPaper

OpenThoughts-Agent:智能体模型的数据配方

OpenThoughts-Agent: Data Recipes for Agentic Models

June 23, 2026
作者: Negin Raoof, Richard Zhuang, Marianna Nezhurina, Etash Guha, Atula Tejaswi, Ryan Marten, Charlie F. Ruan, Tyler Griggs, Alexander Glenn Shaw, Hritik Bansal, E. Kelly Buchanan, Artem Gazizov, Reinhard Heckel, Chinmay Hegde, Sankalp Jajee, Daanish Khazi, Emmanouil Koukoumidis, Xiangyi Li, Hange Liu, Shlok Natarajan, Harsh Raj, Nicholas Roberts, Ethan Shen, Nishad Singhi, Michael Siu, Ashima Suvarna, Hanwen Xing, Patrick Yubeaton, Robert Zhang, Leon Liangyu Chen, Xiaokun Chen, Steven Dillmann, Saadia Gabriel, Xunyi Jiang, Anurag Kashyap, Boxuan Li, Yein Park, Minh Pham, Sujay Sanghavi, Lin Shi, Ke Sun, Yixin Wang, Zhiwei Xu, Erica Zhang, Siyan Zhao, Wanjia Zhao, Jenia Jitsev, Alex Dimakis, Benjamin Feuer, Ludwig Schmidt
cs.AI

摘要

智能体语言模型极大地拓展了人工智能的应用场景,但关于如何为通用型智能体筛选训练数据,目前可公开获取的知识仍然十分有限。现有开源项目如SWE-Smith、SERA和Nemotron-Terminal通常仅针对单一基准测试,未能解决如何训练模型以泛化到多种智能体任务的问题。OpenThoughts-Agent(OT-Agent)项目通过构建完全开源的数据整理流程填补了这一空白。我们开展了超过100项受控消融实验,系统探究流程的每个阶段,揭示了任务来源与多样性的重要性。随后,我们利用该流程整理了包含10万个示例的训练集,对Qwen3-32B模型进行微调,在七个智能体基准测试中实现了平均44.8%的准确率,相比现有最强的开源数据智能体模型(Nemotron-Terminal-32B,40.9%)提升了3.9个百分点。此外,我们的训练数据展现出强大的扩展特性,在计算资源受控的对比实验中,每种训练数据规模下的表现均优于其他开源数据集。我们已在openthoughts.ai平台公开了训练集、数据流程、实验数据及模型,以支持未来关于智能体模型训练的开源研究。
English
Agentic language models dramatically expand the applications of AI yet little is publicly known about how to curate training data for broadly capable agents. Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to train models that generalize across diverse agentic tasks. The OpenThoughts-Agent (OT-Agent) project addresses this gap with a fully open data curation pipeline for training agentic models. We conduct more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline, yielding insights on the importance of task sources and diversity. We then assemble a training set of 100K examples from our pipeline and fine-tune Qwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agentic benchmarks and a 3.9 percentage point improvement over the strongest existing open data agentic model (Nemotron-Terminal-32B, 40.9%). Moreover, our training data exhibits strong scaling properties, outperforming alternative open datasets at every training set size in compute-controlled comparisons. We publicly release our training sets, data pipeline, experimental data, and models at openthoughts.ai to support future open research on agentic model training.