开源推理模型缺失的一环：一个缓解冷启动短链思维（CoT）大语言模型在强化学习中困境的数据集

摘要

随着R1这一公开可用的大型推理模型（LRM）的发布，研究人员普遍通过在R1的长链思维（CoT）推理上训练语言模型来培养新的LRM。尽管先前的研究表明，LRM的能力可以通过直接蒸馏得以复现，但对现有模型（如R1）的持续依赖仍是推动该领域发展的关键限制。作为迈向独立LRM开发的第一步，本文探索了利用未针对推理时扩展进行训练的大型语言模型（LLMs）构建长链CoT数据集的可能性。为此，我们推出了“长链CoT集”，一个包含10万条CoT推理路径的数据集，这些路径由现有的短链CoT LLMs标注完成。我们开发了一套流程，将o1新颖的推理策略引入短链CoT LLMs中，使它们能够进行更长时间的思考，并引入对思维预算的可控性，以更好地管理过度思考的问题。我们的广泛分析验证了该数据集的质量与R1相当或略低。此外，实验结果表明，基于我们数据集进行训练不仅增强了通用推理能力，还为强化学习奠定了坚实基础——以我们的数据初始化的模型在使用RLVR时实现了2至3倍的增益提升。

English

With the release of R1, a publicly available large reasoning model (LRM), researchers commonly train new LRMs by training language models on R1's long chain-of-thought (CoT) inferences. While prior works show that LRMs' capabilities can be reproduced through direct distillation, the continued reliance on the existing models (e.g., R1) remains a critical limitation in advancing the field. As a first step toward independent LRM development, this paper explores the possibility of constructing a long CoT dataset with LLMs that are not trained for inference-time scaling. To this end, we present the Long CoT Collection, a dataset of 100K CoT rationales annotated using existing short CoT LLMs. We develop a pipeline that induces o1's novel reasoning strategies into short CoT LLMs, enabling them to think longer and introducing controllability over the thought budget to better manage the overthinking problem. Our extensive analyses validate that our dataset achieves quality comparable to--or slightly below--R1. Furthermore, our experiments demonstrate that training on our dataset not only strengthens general reasoning skills, but also provides a strong foundation for reinforcement learning--models initialized on our data achieve 2-3x larger gains with RLVR.

开源推理模型缺失的一环：一个缓解冷启动短链思维（CoT）大语言模型在强化学习中困境的数据集

One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL

摘要

Support