OThink-MR1：透過動態強化學習激發多模態通用推理能力

摘要

多模态大语言模型（MLLMs）因其能够处理多样化的输入数据类型并在各种应用中生成连贯、上下文相关的输出而获得了显著关注。尽管监督微调（SFT）一直是增强MLLM在任务特定优化中能力的主要方法，但它在培养关键的泛化推理能力方面往往表现不足。虽然强化学习（RL）在克服这些局限性方面具有巨大潜力，但它面临两个重大挑战：（1）其在多模态任务中的泛化能力在很大程度上尚未被探索，（2）其训练约束，包括持续的Kullback-Leibler散度或钳制策略，常常导致次优的瓶颈。为了解决这些挑战，我们提出了OThink-MR1，这是一种具备深刻理解和推理能力的先进MLLM，适用于多模态任务。具体而言，我们引入了带有动态Kullback-Leibler策略的组相对策略优化（GRPO-D），显著提升了强化学习（RL）的性能。对于Qwen2-VL-2B-Instruct，GRPO-D在两个适应数据集上的同任务评估中，相对于SFT实现了超过5.72%的相对提升，相对于GRPO实现了超过13.59%的相对提升。此外，GRPO-D展示了卓越的跨任务泛化能力，在跨任务评估中，相对于SFT平均实现了超过61.63%的相对提升。这些结果表明，使用GRPO-D在一个多模态任务上训练的MLLM可以有效地迁移到另一个任务，凸显了我们提出的OThink-MR1模型在泛化推理能力方面的优越性。

English

Multimodal Large Language Models (MLLMs) have gained significant traction for their ability to process diverse input data types and generate coherent, contextually relevant outputs across various applications. While supervised fine-tuning (SFT) has been the predominant approach to enhance MLLM capabilities in task-specific optimization, it often falls short in fostering crucial generalized reasoning abilities. Although reinforcement learning (RL) holds great promise in overcoming these limitations, it encounters two significant challenges: (1) its generalized capacities in multimodal tasks remain largely unexplored, and (2) its training constraints, including the constant Kullback-Leibler divergence or the clamp strategy, often result in suboptimal bottlenecks. To address these challenges, we propose OThink-MR1, an advanced MLLM equipped with profound comprehension and reasoning capabilities across multimodal tasks. Specifically, we introduce Group Relative Policy Optimization with a dynamic Kullback-Leibler strategy (GRPO-D), which markedly enhances reinforcement learning (RL) performance. For Qwen2-VL-2B-Instruct, GRPO-D achieves a relative improvement of more than 5.72% over SFT and more than 13.59% over GRPO in same-task evaluation on two adapted datasets. Furthermore, GRPO-D demonstrates remarkable cross-task generalization capabilities, with an average relative improvement of more than 61.63% over SFT in cross-task evaluation. These results highlight that the MLLM trained with GRPO-D on one multimodal task can be effectively transferred to another task, underscoring the superior generalized reasoning capabilities of our proposed OThink-MR1 model.

OThink-MR1：透過動態強化學習激發多模態通用推理能力

OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning

摘要

Support