ChatPaper.aiChatPaper

開放思維:推理模型的數據配方

OpenThoughts: Data Recipes for Reasoning Models

June 4, 2025
作者: Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, Yichuan Deng, Sarah Pratt, Vivek Ramanujan, Jon Saad-Falcon, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A. Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G. Dimakis, Ludwig Schmidt
cs.AI

摘要

推理模型在涉及數學、編程和科學的許多基準測試中取得了快速進展。然而,關於最佳訓練方法仍存在許多未解之謎,因為最先進的模型通常依賴於幾乎沒有公開信息的專有數據集。為解決這一問題,OpenThoughts項目的目標是創建用於訓練推理模型的開源數據集。經過初步探索,我們的OpenThoughts2-1M數據集催生了OpenThinker2-32B,這是首個基於公開推理數據訓練的模型,在AIME和LiveCodeBench等標準推理基準測試中與DeepSeek-R1-Distill-32B表現相當。隨後,我們通過系統性地研究數據生成管道的每個步驟,進行了1000多項對照實驗,進一步改進了數據集,從而推出了OpenThoughts3。將管道擴展至120萬個樣本並使用QwQ-32B作為教師模型,我們得到了OpenThinker3-7B模型,該模型取得了最先進的成果:在AIME 2025上達到53%,在LiveCodeBench 06/24-01/25上達到51%,在GPQA Diamond上達到54%。我們所有的數據集和模型均可通過https://openthoughts.ai獲取。
English
Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond. All of our datasets and models are available on https://openthoughts.ai.
PDF282June 5, 2025