オープンソース推論モデルに欠けていた一つのピース：RLにおける短いCoT LLMのコールドスタートを緩和するためのデータセット

要旨

公開された大規模推論モデル（LRM）であるR1のリリースに伴い、研究者たちは一般的に、R1の長い連鎖思考（CoT）推論を用いて言語モデルを訓練することで、新しいLRMを訓練しています。これまでの研究では、LRMの能力が直接的な蒸留によって再現可能であることが示されていますが、既存のモデル（例：R1）への依存が続くことは、この分野の進歩における重要な制約となっています。独立したLRM開発に向けた第一歩として、本論文では、推論時のスケーリングを目的としないLLMを用いて長いCoTデータセットを構築する可能性を探ります。この目的のために、既存の短いCoT LLMを用いて注釈付けされた10万件のCoT推論からなる「Long CoT Collection」データセットを提示します。私たちは、短いCoT LLMにo1の新しい推論戦略を導入し、それらがより長く思考できるようにし、過剰思考問題をより適切に管理するための思考予算の制御性を導入するパイプラインを開発しました。私たちの詳細な分析により、このデータセットがR1と同等か、わずかに劣る品質を達成していることが検証されました。さらに、私たちの実験では、このデータセットで訓練を行うことで、一般的な推論スキルが強化されるだけでなく、強化学習の強固な基盤が提供されることが示されています。私たちのデータで初期化されたモデルは、RLVRを用いることで2～3倍の大きな向上を達成しました。

English

With the release of R1, a publicly available large reasoning model (LRM), researchers commonly train new LRMs by training language models on R1's long chain-of-thought (CoT) inferences. While prior works show that LRMs' capabilities can be reproduced through direct distillation, the continued reliance on the existing models (e.g., R1) remains a critical limitation in advancing the field. As a first step toward independent LRM development, this paper explores the possibility of constructing a long CoT dataset with LLMs that are not trained for inference-time scaling. To this end, we present the Long CoT Collection, a dataset of 100K CoT rationales annotated using existing short CoT LLMs. We develop a pipeline that induces o1's novel reasoning strategies into short CoT LLMs, enabling them to think longer and introducing controllability over the thought budget to better manage the overthinking problem. Our extensive analyses validate that our dataset achieves quality comparable to--or slightly below--R1. Furthermore, our experiments demonstrate that training on our dataset not only strengthens general reasoning skills, but also provides a strong foundation for reinforcement learning--models initialized on our data achieve 2-3x larger gains with RLVR.

オープンソース推論モデルに欠けていた一つのピース：RLにおける短いCoT LLMのコールドスタートを緩和するためのデータセット

One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL

要旨

Support