DeepSeek-R1發佈百日：複現研究與更多方向的調查推理語言模型的發展路徑

摘要

近期推理語言模型（RLMs）的發展標誌著大型語言模型領域的一次新穎演進。特別是DeepSeek-R1的發布，不僅產生了廣泛的社會影響，也激發了研究社群對於探索語言模型顯式推理範式的熱情。然而，DeepSeek並未完全開源其發布模型的實現細節，包括DeepSeek-R1-Zero、DeepSeek-R1以及蒸餾後的小型模型。因此，許多複製研究應運而生，旨在重現DeepSeek-R1所達到的優異性能，通過類似的訓練流程和完全開源的數據資源，達到可比的性能水平。這些研究探討了監督微調（SFT）和基於可驗證獎勵的強化學習（RLVR）的可行策略，聚焦於數據準備和方法設計，得出了多種有價值的見解。在本報告中，我們總結了最近的複製研究，以期啟發未來的研究方向。我們主要聚焦於SFT和RLVR這兩個主要方向，介紹了當前複製研究在數據構建、方法設計和訓練流程上的細節。此外，我們從這些研究報告的實現細節和實驗結果中總結了關鍵發現，期待能激發未來的研究靈感。我們還討論了增強RLMs的其他技術，強調了擴展這些模型應用範圍的潛力，並探討了發展中的挑戰。通過這份調查，我們旨在幫助RLMs的研究者和開發者緊跟最新進展，並尋求進一步提升RLMs的新思路。

English

The recent development of reasoning language models (RLMs) represents a novel evolution in large language models. In particular, the recent release of DeepSeek-R1 has generated widespread social impact and sparked enthusiasm in the research community for exploring the explicit reasoning paradigm of language models. However, the implementation details of the released models have not been fully open-sourced by DeepSeek, including DeepSeek-R1-Zero, DeepSeek-R1, and the distilled small models. As a result, many replication studies have emerged aiming to reproduce the strong performance achieved by DeepSeek-R1, reaching comparable performance through similar training procedures and fully open-source data resources. These works have investigated feasible strategies for supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR), focusing on data preparation and method design, yielding various valuable insights. In this report, we provide a summary of recent replication studies to inspire future research. We primarily focus on SFT and RLVR as two main directions, introducing the details for data construction, method design and training procedure of current replication studies. Moreover, we conclude key findings from the implementation details and experimental results reported by these studies, anticipating to inspire future research. We also discuss additional techniques of enhancing RLMs, highlighting the potential of expanding the application scope of these models, and discussing the challenges in development. By this survey, we aim to help researchers and developers of RLMs stay updated with the latest advancements, and seek to inspire new ideas to further enhance RLMs.

DeepSeek-R1發佈百日：複現研究與更多方向的調查推理語言模型的發展路徑

100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models

摘要

Support