Skywork-R1V3 技術報告

摘要

我們推出Skywork-R1V3，這是一款先進的開源視覺語言模型（VLM），開創了視覺推理的新方法。其關鍵創新在於有效地將僅限於文本的大型語言模型（LLMs）的推理能力轉移至視覺任務。Skywork-R1V3的卓越性能主要源自我們精心設計的後訓練強化學習（RL）框架，該框架無需額外的持續預訓練，便能有效激活並增強模型的推理能力。通過這一框架，我們進一步揭示了連接器模塊在實現多模態推理模型穩健跨模態對齊中的基礎作用。此外，我們引入了一種獨特的推理能力指標——關鍵推理詞元的熵，這在RL訓練期間的檢查點選擇中已被證明極為有效。Skywork-R1V3在MMMU上取得了領先的成果，顯著從64.3%提升至76.0%，這一性能與入門級人類能力相當。值得注意的是，我們的RL驅動後訓練方法使得僅38B參數的模型也能與頂級閉源VLM相媲美。該實現成功將數學推理轉移至其他學科相關的推理任務。我們還分析了課程學習與強化微調策略，並對多模態推理進行了更廣泛的討論。Skywork-R1V3代表了多模態推理的重大飛躍，展示了RL作為提升開源VLM能力的強大引擎。

English

We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model's reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.