OThink-MR1: 動的強化学習によるマルチモーダル汎用推論能力の促進

要旨

マルチモーダル大規模言語モデル（MLLMs）は、多様な入力データタイプを処理し、様々なアプリケーションにおいて一貫性のある文脈上適切な出力を生成する能力により、大きな注目を集めています。タスク固有の最適化においてMLLMの能力を向上させるための主要なアプローチとして教師ありファインチューニング（SFT）が用いられてきましたが、SFTは重要な汎用的推論能力を育む点でしばしば不十分です。強化学習（RL）はこれらの制限を克服する大きな可能性を秘めていますが、二つの重要な課題に直面しています：（1）マルチモーダルタスクにおけるその汎用的能力はほとんど未開拓であり、（2）Kullback-Leiblerダイバージェンスの固定値やクランプ戦略などのトレーニング制約が、しばしば最適でないボトルネックを引き起こします。これらの課題に対処するため、我々はOThink-MR1を提案します。これは、マルチモーダルタスクにわたる深い理解と推論能力を備えた先進的なMLLMです。具体的には、動的Kullback-Leibler戦略を採用したグループ相対ポリシー最適化（GRPO-D）を導入し、強化学習（RL）の性能を著しく向上させます。Qwen2-VL-2B-Instructにおいて、GRPO-Dは、二つの適応データセットにおける同一タスク評価でSFTに対して5.72%以上、GRPOに対して13.59%以上の相対的な改善を達成しました。さらに、GRPO-Dは顕著なクロスタスク汎化能力を示し、クロスタスク評価においてSFTに対して平均61.63%以上の相対的な改善を達成しました。これらの結果は、GRPO-DでトレーニングされたMLLMが一つのマルチモーダルタスクから別のタスクに効果的に転移可能であることを強調し、提案したOThink-MR1モデルの優れた汎用的推論能力を裏付けています。

English

Multimodal Large Language Models (MLLMs) have gained significant traction for their ability to process diverse input data types and generate coherent, contextually relevant outputs across various applications. While supervised fine-tuning (SFT) has been the predominant approach to enhance MLLM capabilities in task-specific optimization, it often falls short in fostering crucial generalized reasoning abilities. Although reinforcement learning (RL) holds great promise in overcoming these limitations, it encounters two significant challenges: (1) its generalized capacities in multimodal tasks remain largely unexplored, and (2) its training constraints, including the constant Kullback-Leibler divergence or the clamp strategy, often result in suboptimal bottlenecks. To address these challenges, we propose OThink-MR1, an advanced MLLM equipped with profound comprehension and reasoning capabilities across multimodal tasks. Specifically, we introduce Group Relative Policy Optimization with a dynamic Kullback-Leibler strategy (GRPO-D), which markedly enhances reinforcement learning (RL) performance. For Qwen2-VL-2B-Instruct, GRPO-D achieves a relative improvement of more than 5.72% over SFT and more than 13.59% over GRPO in same-task evaluation on two adapted datasets. Furthermore, GRPO-D demonstrates remarkable cross-task generalization capabilities, with an average relative improvement of more than 61.63% over SFT in cross-task evaluation. These results highlight that the MLLM trained with GRPO-D on one multimodal task can be effectively transferred to another task, underscoring the superior generalized reasoning capabilities of our proposed OThink-MR1 model.

OThink-MR1: 動的強化学習によるマルチモーダル汎用推論能力の促進

OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning

要旨

Support