OThink-MR1: 동적 강화 학습을 통한 다중 모달 일반화 추론 능력 자극

초록

멀티모달 대형 언어 모델(MLLMs)은 다양한 입력 데이터 유형을 처리하고 여러 응용 분야에서 일관적이고 문맥적으로 적절한 출력을 생성할 수 있는 능력으로 인해 상당한 주목을 받고 있습니다. 작업별 최적화를 통해 MLLM의 능력을 향상시키기 위해 지도 미세 조정(SFT)이 주된 접근 방식으로 사용되어 왔지만, 이는 중요한 일반화된 추론 능력을 키우는 데 있어서 종종 한계를 보입니다. 강화 학습(RL)은 이러한 한계를 극복할 수 있는 큰 잠재력을 가지고 있지만, 두 가지 중요한 과제에 직면합니다: (1) 멀티모달 작업에서의 일반화된 능력이 크게 탐구되지 않았으며, (2) Kullback-Leibler 발산이나 클램프 전략과 같은 훈련 제약으로 인해 종종 최적이 아닌 병목 현상이 발생합니다. 이러한 과제를 해결하기 위해, 우리는 멀티모달 작업에서 깊은 이해와 추론 능력을 갖춘 고급 MLLM인 OThink-MR1을 제안합니다. 특히, 동적 Kullback-Leibler 전략을 포함한 그룹 상대 정책 최적화(GRPO-D)를 도입하여 강화 학습(RL) 성능을 크게 향상시켰습니다. Qwen2-VL-2B-Instruct의 경우, GRPO-D는 두 개의 적응된 데이터셋에서 동일 작업 평가에서 SFT 대비 5.72% 이상, GRPO 대비 13.59% 이상의 상대적 개선을 달성했습니다. 또한, GRPO-D는 작업 간 일반화 능력에서도 뛰어난 성과를 보였으며, 작업 간 평가에서 SFT 대비 평균 61.63% 이상의 상대적 개선을 보였습니다. 이러한 결과는 GRPO-D로 훈련된 MLLM이 하나의 멀티모달 작업에서 다른 작업으로 효과적으로 전이될 수 있음을 보여주며, 제안된 OThink-MR1 모델의 우수한 일반화된 추론 능력을 강조합니다.

English

Multimodal Large Language Models (MLLMs) have gained significant traction for their ability to process diverse input data types and generate coherent, contextually relevant outputs across various applications. While supervised fine-tuning (SFT) has been the predominant approach to enhance MLLM capabilities in task-specific optimization, it often falls short in fostering crucial generalized reasoning abilities. Although reinforcement learning (RL) holds great promise in overcoming these limitations, it encounters two significant challenges: (1) its generalized capacities in multimodal tasks remain largely unexplored, and (2) its training constraints, including the constant Kullback-Leibler divergence or the clamp strategy, often result in suboptimal bottlenecks. To address these challenges, we propose OThink-MR1, an advanced MLLM equipped with profound comprehension and reasoning capabilities across multimodal tasks. Specifically, we introduce Group Relative Policy Optimization with a dynamic Kullback-Leibler strategy (GRPO-D), which markedly enhances reinforcement learning (RL) performance. For Qwen2-VL-2B-Instruct, GRPO-D achieves a relative improvement of more than 5.72% over SFT and more than 13.59% over GRPO in same-task evaluation on two adapted datasets. Furthermore, GRPO-D demonstrates remarkable cross-task generalization capabilities, with an average relative improvement of more than 61.63% over SFT in cross-task evaluation. These results highlight that the MLLM trained with GRPO-D on one multimodal task can be effectively transferred to another task, underscoring the superior generalized reasoning capabilities of our proposed OThink-MR1 model.

OThink-MR1: 동적 강화 학습을 통한 다중 모달 일반화 추론 능력 자극

OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning

초록

Support