DeepSeek-R1発表から100日：複製研究とその先の調査推論言語モデルの方向性

要旨

最近の推論言語モデル（RLM）の開発は、大規模言語モデルの新たな進化を表しています。特に、DeepSeek-R1の最近のリリースは、広範な社会的影響を生み出し、言語モデルの明示的な推論パラダイムを探求する研究コミュニティの熱意をかき立てました。しかし、DeepSeek-R1-Zero、DeepSeek-R1、および蒸留された小型モデルを含む、リリースされたモデルの実装詳細は、DeepSeekによって完全にオープンソース化されていません。その結果、DeepSeek-R1が達成した強力な性能を再現することを目指す多くの再現研究が登場し、同様のトレーニング手順と完全にオープンソースのデータリソースを通じて同等の性能に到達しようとしています。これらの研究は、検証可能な報酬からの強化学習（RLVR）と教師ありファインチューニング（SFT）のための実行可能な戦略を調査し、データ準備とメソッド設計に焦点を当て、さまざまな貴重な洞察を生み出しています。本レポートでは、今後の研究を刺激するために、最近の再現研究の概要を提供します。主にSFTとRLVRを2つの主要な方向性として、現在の再現研究のデータ構築、メソッド設計、トレーニング手順の詳細を紹介します。さらに、これらの研究が報告した実装詳細と実験結果から得られた主要な知見をまとめ、今後の研究を刺激することを期待しています。また、RLMを強化するための追加の技術について議論し、これらのモデルの適用範囲を拡大する可能性を強調し、開発における課題についても議論します。本調査を通じて、RLMの研究者と開発者が最新の進展を把握し、RLMをさらに強化するための新しいアイデアを探求することを目指しています。

English

The recent development of reasoning language models (RLMs) represents a novel evolution in large language models. In particular, the recent release of DeepSeek-R1 has generated widespread social impact and sparked enthusiasm in the research community for exploring the explicit reasoning paradigm of language models. However, the implementation details of the released models have not been fully open-sourced by DeepSeek, including DeepSeek-R1-Zero, DeepSeek-R1, and the distilled small models. As a result, many replication studies have emerged aiming to reproduce the strong performance achieved by DeepSeek-R1, reaching comparable performance through similar training procedures and fully open-source data resources. These works have investigated feasible strategies for supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR), focusing on data preparation and method design, yielding various valuable insights. In this report, we provide a summary of recent replication studies to inspire future research. We primarily focus on SFT and RLVR as two main directions, introducing the details for data construction, method design and training procedure of current replication studies. Moreover, we conclude key findings from the implementation details and experimental results reported by these studies, anticipating to inspire future research. We also discuss additional techniques of enhancing RLMs, highlighting the potential of expanding the application scope of these models, and discussing the challenges in development. By this survey, we aim to help researchers and developers of RLMs stay updated with the latest advancements, and seek to inspire new ideas to further enhance RLMs.

DeepSeek-R1発表から100日：複製研究とその先の調査推論言語モデルの方向性

100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models

要旨

Support