开放思维:推理模型的数据配方
OpenThoughts: Data Recipes for Reasoning Models
June 4, 2025
作者: Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, Yichuan Deng, Sarah Pratt, Vivek Ramanujan, Jon Saad-Falcon, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A. Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G. Dimakis, Ludwig Schmidt
cs.AI
摘要
推理模型在涉及数学、编程和科学的众多基准测试中取得了快速进展。然而,关于最佳训练方法仍存在许多未解之谜,因为最先进的模型往往依赖于专有数据集,而这些数据集几乎没有公开信息。为解决这一问题,OpenThoughts项目的目标是创建用于训练推理模型的开源数据集。经过初步探索,我们的OpenThoughts2-1M数据集催生了OpenThinker2-32B,这是首个在公开推理数据上训练的模型,在AIME和LiveCodeBench等标准推理基准上媲美DeepSeek-R1-Distill-32B。随后,我们通过1000多次对照实验系统地研究了数据生成管道的每一步,进一步改进了数据集,推出了OpenThoughts3。将管道扩展至120万条样本,并采用QwQ-32B作为教师模型,我们得到了OpenThinker3-7B模型,该模型实现了最先进的成果:在AIME 2025上达到53%,在LiveCodeBench 06/24-01/25上达到51%,在GPQA Diamond上达到54%。我们的所有数据集和模型均可在https://openthoughts.ai获取。
English
Reasoning models have made rapid progress on many benchmarks involving math,
code, and science. Yet, there are still many open questions about the best
training recipes for reasoning since state-of-the-art models often rely on
proprietary datasets with little to no public information available. To address
this, the goal of the OpenThoughts project is to create open-source datasets
for training reasoning models. After initial explorations, our OpenThoughts2-1M
dataset led to OpenThinker2-32B, the first model trained on public reasoning
data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as
AIME and LiveCodeBench. We then improve our dataset further by systematically
investigating each step of our data generation pipeline with 1,000+ controlled
experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples
and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves
state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25,
and 54% on GPQA Diamond. All of our datasets and models are available on
https://openthoughts.ai.