MoDoMoDo：多領域數據混合用於多模態大語言模型強化學習

摘要

可驗證獎勵的強化學習（RLVR）近期已成為大型語言模型（LLMs）後訓練的一種強大範式，在具有結構化、可驗證答案的任務上達到了最先進的性能。將RLVR應用於多模態大型語言模型（MLLMs）雖展現出顯著的潛力，但由於視覺語言任務的廣泛性和異質性，這些任務需要細膩的視覺、邏輯和空間能力，使得應用過程變得複雜。因此，在多個數據集上使用RLVR訓練MLLMs可能帶來益處，但也會因不同數據集間的交互導致目標衝突，凸顯了優化數據集混合策略以提升泛化能力和推理能力的必要性。我們引入了一個系統化的多模態LLM RLVR後訓練框架，包括嚴謹的數據混合問題公式化及基準實現。具體而言，（1）我們開發了一個多模態RLVR框架，用於多數據集後訓練，通過策劃一個包含不同可驗證視覺語言問題的數據集，並支持多領域在線RL學習，配以不同的可驗證獎勵；（2）我們提出了一種數據混合策略，該策略能夠從數據混合分佈中預測RL微調的結果，從而優化最佳混合方案。全面的實驗表明，結合混合預測策略的多領域RLVR訓練，能顯著提升MLLM的通用推理能力。我們的最佳混合方案使後訓練模型在分佈外基準測試上的準確率平均提升了5.24%，相比於使用均勻數據混合進行後訓練的同一模型，並且相比於微調前的基線總共提升了20.74%。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for post-training large language models (LLMs), achieving state-of-the-art performance on tasks with structured, verifiable answers. Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but is complicated by the broader, heterogeneous nature of vision-language tasks that demand nuanced visual, logical, and spatial capabilities. As such, training MLLMs using RLVR on multiple datasets could be beneficial but creates challenges with conflicting objectives from interaction among diverse datasets, highlighting the need for optimal dataset mixture strategies to improve generalization and reasoning. We introduce a systematic post-training framework for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation and benchmark implementation. Specifically, (1) We developed a multimodal RLVR framework for multi-dataset post-training by curating a dataset that contains different verifiable vision-language problems and enabling multi-domain online RL learning with different verifiable rewards; (2) We proposed a data mixture strategy that learns to predict the RL fine-tuning outcome from the data mixture distribution, and consequently optimizes the best mixture. Comprehensive experiments showcase that multi-domain RLVR training, when combined with mixture prediction strategies, can significantly boost MLLM general reasoning capacities. Our best mixture improves the post-trained model's accuracy on out-of-distribution benchmarks by an average of 5.24% compared to the same model post-trained with uniform data mixture, and by a total of 20.74% compared to the pre-finetuning baseline.

MoDoMoDo：多領域數據混合用於多模態大語言模型強化學習

MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning

摘要

Support