開源推理模型缺失的一環：一個用於緩解冷啟動短鏈思維語言模型在強化學習中困境的數據集

摘要

隨著公開可用的大型推理模型（LRM）R1的發布，研究人員普遍通過在R1的長鏈思維（CoT）推理上訓練語言模型來訓練新的LRM。雖然先前的研究表明，LRM的能力可以通過直接蒸餾來複製，但對現有模型（如R1）的持續依賴仍然是該領域發展的關鍵限制。作為獨立LRM開發的第一步，本文探討了使用未針對推理時擴展訓練的大型語言模型（LLM）構建長鏈思維數據集的可能性。為此，我們提出了長鏈思維集合（Long CoT Collection），這是一個包含10萬條CoT推理的數據集，使用現有的短鏈思維LLM進行註釋。我們開發了一個管道，將o1的新穎推理策略引入短鏈思維LLM，使它們能夠進行更長時間的思考，並引入對思維預算的可控性，以更好地管理過度思考的問題。我們的大量分析驗證了我們的數據集達到了與R1相當或略低的質量。此外，我們的實驗表明，在我們的數據集上訓練不僅增強了通用推理能力，還為強化學習提供了堅實的基礎——在我們的數據上初始化的模型在RLVR中獲得了2-3倍的增益。

English

With the release of R1, a publicly available large reasoning model (LRM), researchers commonly train new LRMs by training language models on R1's long chain-of-thought (CoT) inferences. While prior works show that LRMs' capabilities can be reproduced through direct distillation, the continued reliance on the existing models (e.g., R1) remains a critical limitation in advancing the field. As a first step toward independent LRM development, this paper explores the possibility of constructing a long CoT dataset with LLMs that are not trained for inference-time scaling. To this end, we present the Long CoT Collection, a dataset of 100K CoT rationales annotated using existing short CoT LLMs. We develop a pipeline that induces o1's novel reasoning strategies into short CoT LLMs, enabling them to think longer and introducing controllability over the thought budget to better manage the overthinking problem. Our extensive analyses validate that our dataset achieves quality comparable to--or slightly below--R1. Furthermore, our experiments demonstrate that training on our dataset not only strengthens general reasoning skills, but also provides a strong foundation for reinforcement learning--models initialized on our data achieve 2-3x larger gains with RLVR.

開源推理模型缺失的一環：一個用於緩解冷啟動短鏈思維語言模型在強化學習中困境的數據集

One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL

摘要

Support