MoDoMoDo:多領域數據混合用於多模態大語言模型強化學習
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning
May 30, 2025
作者: Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, Jiacheng Zhu
cs.AI
摘要
可驗證獎勵的強化學習(RLVR)近期已成為大型語言模型(LLMs)後訓練的一種強大範式,在具有結構化、可驗證答案的任務上達到了最先進的性能。將RLVR應用於多模態大型語言模型(MLLMs)雖展現出顯著的潛力,但由於視覺語言任務的廣泛性和異質性,這些任務需要細膩的視覺、邏輯和空間能力,使得應用過程變得複雜。因此,在多個數據集上使用RLVR訓練MLLMs可能帶來益處,但也會因不同數據集間的交互導致目標衝突,凸顯了優化數據集混合策略以提升泛化能力和推理能力的必要性。我們引入了一個系統化的多模態LLM RLVR後訓練框架,包括嚴謹的數據混合問題公式化及基準實現。具體而言,(1)我們開發了一個多模態RLVR框架,用於多數據集後訓練,通過策劃一個包含不同可驗證視覺語言問題的數據集,並支持多領域在線RL學習,配以不同的可驗證獎勵;(2)我們提出了一種數據混合策略,該策略能夠從數據混合分佈中預測RL微調的結果,從而優化最佳混合方案。全面的實驗表明,結合混合預測策略的多領域RLVR訓練,能顯著提升MLLM的通用推理能力。我們的最佳混合方案使後訓練模型在分佈外基準測試上的準確率平均提升了5.24%,相比於使用均勻數據混合進行後訓練的同一模型,並且相比於微調前的基線總共提升了20.74%。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as
a powerful paradigm for post-training large language models (LLMs), achieving
state-of-the-art performance on tasks with structured, verifiable answers.
Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but
is complicated by the broader, heterogeneous nature of vision-language tasks
that demand nuanced visual, logical, and spatial capabilities. As such,
training MLLMs using RLVR on multiple datasets could be beneficial but creates
challenges with conflicting objectives from interaction among diverse datasets,
highlighting the need for optimal dataset mixture strategies to improve
generalization and reasoning. We introduce a systematic post-training framework
for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation
and benchmark implementation. Specifically, (1) We developed a multimodal RLVR
framework for multi-dataset post-training by curating a dataset that contains
different verifiable vision-language problems and enabling multi-domain online
RL learning with different verifiable rewards; (2) We proposed a data mixture
strategy that learns to predict the RL fine-tuning outcome from the data
mixture distribution, and consequently optimizes the best mixture.
Comprehensive experiments showcase that multi-domain RLVR training, when
combined with mixture prediction strategies, can significantly boost MLLM
general reasoning capacities. Our best mixture improves the post-trained
model's accuracy on out-of-distribution benchmarks by an average of 5.24%
compared to the same model post-trained with uniform data mixture, and by a
total of 20.74% compared to the pre-finetuning baseline.