MoDoMoDo：多领域数据混合驱动的多模态大语言模型强化学习

摘要

基于可验证奖励的强化学习（RLVR）近期已成为一种强大的范式，用于大型语言模型（LLMs）的后期训练，在具有结构化、可验证答案的任务上实现了最先进的性能。将RLVR应用于多模态大型语言模型（MLLMs）展现出巨大的潜力，但由于视觉-语言任务更为广泛且异质，要求模型具备细致的视觉、逻辑和空间能力，这一过程变得复杂。因此，在多个数据集上使用RLVR训练MLLMs虽有益处，却因不同数据集间交互导致目标冲突而带来挑战，凸显了优化数据集混合策略以提升泛化与推理能力的必要性。我们提出了一种系统化的多模态LLM RLVR后期训练框架，包含严谨的数据混合问题公式化及基准实现。具体而言，（1）我们构建了一个包含多种可验证视觉-语言问题的数据集，并支持多领域在线RL学习，利用不同的可验证奖励，开发了多模态RLVR框架；（2）我们提出了一种数据混合策略，该策略能够从数据混合分布中预测RL微调结果，进而优化最佳混合方案。全面实验表明，结合混合预测策略的多领域RLVR训练，能显著提升MLLM的通用推理能力。我们的最佳混合方案使后期训练模型在分布外基准测试上的准确率平均提升了5.24%，相较于采用均匀数据混合的同一模型，以及相比预微调基线，总共提升了20.74%。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for post-training large language models (LLMs), achieving state-of-the-art performance on tasks with structured, verifiable answers. Applying RLVR to Multimodal LLMs (MLLMs) presents significant opportunities but is complicated by the broader, heterogeneous nature of vision-language tasks that demand nuanced visual, logical, and spatial capabilities. As such, training MLLMs using RLVR on multiple datasets could be beneficial but creates challenges with conflicting objectives from interaction among diverse datasets, highlighting the need for optimal dataset mixture strategies to improve generalization and reasoning. We introduce a systematic post-training framework for Multimodal LLM RLVR, featuring a rigorous data mixture problem formulation and benchmark implementation. Specifically, (1) We developed a multimodal RLVR framework for multi-dataset post-training by curating a dataset that contains different verifiable vision-language problems and enabling multi-domain online RL learning with different verifiable rewards; (2) We proposed a data mixture strategy that learns to predict the RL fine-tuning outcome from the data mixture distribution, and consequently optimizes the best mixture. Comprehensive experiments showcase that multi-domain RLVR training, when combined with mixture prediction strategies, can significantly boost MLLM general reasoning capacities. Our best mixture improves the post-trained model's accuracy on out-of-distribution benchmarks by an average of 5.24% compared to the same model post-trained with uniform data mixture, and by a total of 20.74% compared to the pre-finetuning baseline.

MoDoMoDo：多领域数据混合驱动的多模态大语言模型强化学习

MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning

摘要

Support