天工开放推理器1技术报告

摘要

DeepSeek-R1的成功凸显了强化学习（RL）在提升大语言模型（LLMs）推理能力中的重要作用。在本研究中，我们推出了Skywork-OR1，一种针对长链思维（CoT）模型的有效且可扩展的RL实现方案。基于DeepSeek-R1-Distill模型系列，我们的RL方法取得了显著的性能提升，在AIME24、AIME25和LiveCodeBench测试集上，32B模型的平均准确率从57.8%提升至72.8%（+15.0%），7B模型则从43.6%提升至57.5%（+13.9%）。Skywork-OR1-32B模型在AIME24和AIME25基准测试中超越了DeepSeek-R1和Qwen3-32B，同时在LiveCodeBench上取得了与之相当的成绩。Skywork-OR1-7B和Skywork-OR1-Math-7B模型在同等规模模型中展现了具有竞争力的推理能力。我们对训练流程的核心组件进行了全面的消融研究，以验证其有效性。此外，我们深入探讨了熵崩溃现象，识别了影响熵动态的关键因素，并证明缓解过早的熵崩溃对于提升测试性能至关重要。为支持社区研究，我们全面开源了模型权重、训练代码及训练数据集。

English

The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present Skywork-OR1, an effective and scalable RL implementation for long Chain-of-Thought (CoT) models. Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains, increasing average accuracy across AIME24, AIME25, and LiveCodeBench from 57.8% to 72.8% (+15.0%) for the 32B model and from 43.6% to 57.5% (+13.9%) for the 7B model. Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks, while achieving comparable results on LiveCodeBench. The Skywork-OR1-7B and Skywork-OR1-Math-7B models demonstrate competitive reasoning capabilities among models of similar size. We perform comprehensive ablation studies on the core components of our training pipeline to validate their effectiveness. Additionally, we thoroughly investigate the phenomenon of entropy collapse, identify key factors affecting entropy dynamics, and demonstrate that mitigating premature entropy collapse is critical for improved test performance. To support community research, we fully open-source our model weights, training code, and training datasets.

天工开放推理器1技术报告

Skywork Open Reasoner 1 Technical Report

摘要

Support