Skywork-R1V3技术报告

摘要

我们推出Skywork-R1V3，这是一款先进的、开源的视觉-语言模型（VLM），它开创了视觉推理的新方法。其核心创新在于有效地将纯文本大型语言模型（LLMs）的推理能力迁移至视觉任务。Skywork-R1V3的卓越性能主要源于我们精心设计的后训练强化学习（RL）框架，该框架无需额外持续预训练，便能有效激活并增强模型的推理能力。通过这一框架，我们进一步揭示了连接器模块在实现多模态推理模型稳健跨模态对齐中的基础性作用。此外，我们引入了一种独特的推理能力指标——关键推理标记的熵，该指标在RL训练期间的检查点选择中表现出极高的有效性。Skywork-R1V3在MMMU基准测试中取得了最先进的成果，显著从64.3%提升至76.0%，这一表现已与人类入门级能力相当。值得注意的是，我们的RL驱动后训练方法使得即便是38B参数模型也能与顶尖闭源VLMs相抗衡。该实现成功将数学推理迁移至其他学科相关的推理任务。我们还分析了课程学习与强化微调策略，并广泛探讨了多模态推理。Skywork-R1V3标志着多模态推理的重大飞跃，展示了RL作为推动开源VLM能力发展的强大引擎。

English

We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.