OpenVLThinkerV2：面向多领域视觉任务的通用多模态推理模型

摘要

群体相对策略优化（GRPO）已成为推动多模态大语言模型发展的核心强化学习目标。然而，将这一成功扩展至开源多模态通用模型仍面临两大关键挑战：不同视觉任务间奖励拓扑结构的极端差异性，以及细粒度感知与多步推理能力的内在平衡难题。为此，我们提出高斯GRPO（G²RPO）——一种通过非线性分布匹配替代标准线性缩放的新型RL训练目标。该方法通过数学约束强制所有任务的优势分布严格收敛至标准正态分布N(0,1)，从理论上实现了任务间梯度均衡性，增强了对重尾异常值的鲁棒性，并提供正负奖励的对称更新机制。基于G²RPO提升的训练稳定性，我们引入两种任务级塑形机制以平衡感知与推理：首先，响应长度塑形动态激发复杂查询的延伸推理链，同时强制简单查询直接输出以强化视觉定位；其次，熵塑形严格约束模型探索空间，有效防止熵崩溃与熵爆炸。通过集成这些方法，我们推出OpenVLThinkerV2——一个高鲁棒性的通用多模态模型。在18个多样化基准测试中的广泛评估表明，其性能显著优于主流开源模型及领先的专有前沿模型。

English

Group Relative Policy Optimization (GRPO) has emerged as the de facto Reinforcement Learning (RL) objective driving recent advancements in Multimodal Large Language Models. However, extending this success to open-source multimodal generalist models remains heavily constrained by two primary challenges: the extreme variance in reward topologies across diverse visual tasks, and the inherent difficulty of balancing fine-grained perception with multi-step reasoning capabilities. To address these issues, we introduce Gaussian GRPO (G^2RPO), a novel RL training objective that replaces standard linear scaling with non-linear distributional matching. By mathematically forcing the advantage distribution of any given task to strictly converge to a standard normal distribution, N(0,1), G^2RPO theoretically ensures inter-task gradient equity, mitigates vulnerabilities to heavy-tail outliers, and offers symmetric update for positive and negative rewards. Leveraging the enhanced training stability provided by G^2RPO, we introduce two task-level shaping mechanisms to seamlessly balance perception and reasoning. First, response length shaping dynamically elicits extended reasoning chains for complex queries while enforce direct outputs to bolster visual grounding. Second, entropy shaping tightly bounds the model's exploration zone, effectively preventing both entropy collapse and entropy explosion. Integrating these methodologies, we present OpenVLThinkerV2, a highly robust, general-purpose multimodal model. Extensive evaluations across 18 diverse benchmarks demonstrate its superior performance over strong open-source and leading proprietary frontier models.