Skywork R1V2:面向推理的多模态混合强化学习
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
April 23, 2025
作者: Chris, Yichen Wei, Yi Peng, Xiaokun Wang, Weijie Qiu, Wei Shen, Tianyidan Xie, Jiangbo Pei, Jianhao Zhang, Yunzhuo Hao, Xuchen Song, Yang Liu, Yahui Zhou
cs.AI
摘要
我们推出Skywork R1V2,这是一款新一代多模态推理模型,相较于前代Skywork R1V实现了重大跨越。R1V2的核心在于引入了一种混合强化学习范式,该范式巧妙地将奖励模型指导与基于规则的策略相融合,从而解决了长期以来在复杂推理能力与广泛泛化之间寻求平衡的难题。为进一步提升训练效率,我们提出了选择性样本缓冲(SSB)机制,该机制通过在整个优化过程中优先处理高价值样本,有效应对了群体相对策略优化(GRPO)中固有的“优势消失”困境。值得注意的是,我们发现过度的强化信号可能诱发视觉幻觉现象——我们通过在整个训练过程中校准奖励阈值,系统地监控并缓解了这一现象。实证结果证实了R1V2的卓越能力,其在多项基准测试中均取得领先成绩,如OlympiadBench上的62.6分、AIME2024上的79.0分、LiveCodeBench上的63.6分以及MMMU上的74.0分。这些成果不仅彰显了R1V2相对于现有开源模型的优越性,也展示了其在缩小与顶尖专有系统(如Gemini 2.5和OpenAI o4-mini)性能差距方面的显著进展。Skywork R1V2的模型权重已公开发布,以促进开放性和可复现性,访问地址为https://huggingface.co/Skywork/Skywork-R1V2-38B。
English
We present Skywork R1V2, a next-generation multimodal reasoning model and a
major leap forward from its predecessor, Skywork R1V. At its core, R1V2
introduces a hybrid reinforcement learning paradigm that harmonizes
reward-model guidance with rule-based strategies, thereby addressing the
long-standing challenge of balancing sophisticated reasoning capabilities with
broad generalization. To further enhance training efficiency, we propose the
Selective Sample Buffer (SSB) mechanism, which effectively counters the
``Vanishing Advantages'' dilemma inherent in Group Relative Policy Optimization
(GRPO) by prioritizing high-value samples throughout the optimization process.
Notably, we observe that excessive reinforcement signals can induce visual
hallucinations--a phenomenon we systematically monitor and mitigate through
calibrated reward thresholds throughout the training process. Empirical results
affirm the exceptional capability of R1V2, with benchmark-leading performances
such as 62.6 on OlympiadBench, 79.0 on AIME2024, 63.6 on LiveCodeBench, and
74.0 on MMMU. These results underscore R1V2's superiority over existing
open-source models and demonstrate significant progress in closing the
performance gap with premier proprietary systems, including Gemini 2.5 and
OpenAI o4-mini. The Skywork R1V2 model weights have been publicly released to
promote openness and reproducibility
https://huggingface.co/Skywork/Skywork-R1V2-38B.Summary
AI-Generated Summary