UniRL:通过监督学习与强化学习实现统一多模态模型的自我优化
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning
May 29, 2025
作者: Weijia Mao, Zhenheng Yang, Mike Zheng Shou
cs.AI
摘要
诸如Show-o和Janus等统一多模态大语言模型在生成与理解任务上均展现出了卓越的性能。然而,这些模型通常依赖于大规模数据集,并在预训练阶段需要大量计算资源。此外,虽然已有多种后训练方法被提出,但它们往往依赖于外部数据或仅限于特定任务的定制。在本研究中,我们引入了UniRL,一种自我提升的后训练方法。该方法使模型能够从提示生成图像,并在每次迭代中将这些图像作为训练数据,无需依赖任何外部图像数据。更重要的是,它实现了两项任务的相互促进:生成的图像用于理解任务,而理解结果则用于指导生成。我们探索了监督微调(SFT)和组相对策略优化(GRPO)来优化模型。UniRL具备三大优势:(1)无需外部图像数据,所有训练样本均由模型在训练过程中自行生成;(2)不仅提升了个别任务的性能,还减少了生成与理解之间的不平衡;(3)在后训练阶段仅需少量额外训练步骤。我们在Show-o和Janus基础上评估了UniRL,分别获得了0.77和0.65的GenEval评分。代码与模型将在https://github.com/showlab/UniRL发布。
English
Unified multimodal large language models such as Show-o and Janus have
achieved strong performance across both generation and understanding tasks.
However, these models typically rely on large-scale datasets and require
substantial computation during the pretraining stage. In addition, several
post-training methods have been proposed, but they often depend on external
data or are limited to task-specific customization. In this work, we introduce
UniRL, a self-improving post-training approach. Our approach enables the model
to generate images from prompts and use them as training data in each
iteration, without relying on any external image data. Moreover, it enables the
two tasks to enhance each other: the generated images are used for
understanding, and the understanding results are used to supervise generation.
We explore supervised fine-tuning (SFT) and Group Relative Policy Optimization
(GRPO) to optimize the models. UniRL offers three key advantages: (1) it
requires no external image data, as all training samples are generated by the
model itself during training; (2) it not only improves individual task
performance, but also reduces the imbalance between generation and
understanding; and (3) it requires only several additional training steps
during the post-training stage. We evaluate UniRL on top of Show-o and Janus,
achieving a GenEval score of 0.77 for Show-o and 0.65 for Janus. Code and
models will be released in https://github.com/showlab/UniRL.Summary
AI-Generated Summary