DeepSeek-R1发布百日：复制研究及推理语言模型发展路径综述

摘要

近期，推理语言模型（RLMs）的发展标志着大型语言模型领域的一次新突破。特别是DeepSeek-R1的发布，在社会上产生了广泛影响，并激发了研究界探索语言模型显式推理范式的热情。然而，DeepSeek并未完全开源其发布模型的实现细节，包括DeepSeek-R1-Zero、DeepSeek-R1以及蒸馏后的小型模型。因此，众多复现研究应运而生，旨在通过类似的训练流程和完全开源的数据资源，重现DeepSeek-R1所展现的卓越性能，达到与之相当的水平。这些研究深入探讨了监督微调（SFT）和基于可验证奖励的强化学习（RLVR）的可行策略，聚焦于数据准备与方法设计，得出了诸多有价值的见解。本报告汇总了近期复现研究的主要成果，以期启发未来研究。我们重点围绕SFT与RLVR两大方向，介绍了当前复现研究在数据构建、方法设计及训练流程上的具体细节。此外，我们总结了这些研究在实现细节与实验结果中揭示的关键发现，期待为未来研究提供启示。我们还讨论了增强RLMs的其他技术，强调了拓展这些模型应用范围的潜力，并探讨了发展过程中面临的挑战。通过本次综述，我们旨在帮助RLMs的研究者与开发者紧跟最新进展，并激发新思路，进一步推动RLMs的优化与提升。

English

The recent development of reasoning language models (RLMs) represents a novel evolution in large language models. In particular, the recent release of DeepSeek-R1 has generated widespread social impact and sparked enthusiasm in the research community for exploring the explicit reasoning paradigm of language models. However, the implementation details of the released models have not been fully open-sourced by DeepSeek, including DeepSeek-R1-Zero, DeepSeek-R1, and the distilled small models. As a result, many replication studies have emerged aiming to reproduce the strong performance achieved by DeepSeek-R1, reaching comparable performance through similar training procedures and fully open-source data resources. These works have investigated feasible strategies for supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR), focusing on data preparation and method design, yielding various valuable insights. In this report, we provide a summary of recent replication studies to inspire future research. We primarily focus on SFT and RLVR as two main directions, introducing the details for data construction, method design and training procedure of current replication studies. Moreover, we conclude key findings from the implementation details and experimental results reported by these studies, anticipating to inspire future research. We also discuss additional techniques of enhancing RLMs, highlighting the potential of expanding the application scope of these models, and discussing the challenges in development. By this survey, we aim to help researchers and developers of RLMs stay updated with the latest advancements, and seek to inspire new ideas to further enhance RLMs.

DeepSeek-R1发布百日：复制研究及推理语言模型发展路径综述

100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models

摘要

Support