ChatPaper.aiChatPaper

MMaDA:多模态大尺度扩散语言模型

MMaDA: Multimodal Large Diffusion Language Models

May 21, 2025
作者: Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang
cs.AI

摘要

我们推出了MMaDA,一种新型的多模态扩散基础模型,旨在文本推理、多模态理解及文本到图像生成等多个领域实现卓越性能。该方法的三大创新点在于:(一)MMaDA采用统一的扩散架构,具备共享的概率公式和模态无关设计,摒弃了特定模态组件,确保了不同数据类型间的无缝整合与处理。(二)我们实施了混合长链思维(CoT)微调策略,跨模态统一了CoT格式。通过对齐文本与视觉领域的推理过程,此策略为最终强化学习(RL)阶段提供了冷启动训练,从而增强了模型从一开始处理复杂任务的能力。(三)我们提出了UniGRPO,一种专为扩散基础模型设计的统一策略梯度RL算法。利用多样化的奖励建模,UniGRPO统一了推理与生成任务的后训练,确保了性能的持续提升。实验结果显示,MMaDA-8B作为统一的多模态基础模型展现了强大的泛化能力。它在文本推理上超越了LLaMA-3-7B和Qwen2-7B,在多模态理解上优于Show-o和SEED-X,在文本到图像生成方面超越了SDXL和Janus。这些成就凸显了MMaDA在统一扩散架构内弥合预训练与后训练之间差距的有效性,为未来研究与发展提供了一个全面的框架。我们已在https://github.com/Gen-Verse/MMaDA开源了代码及训练模型。
English
We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA

Summary

AI-Generated Summary

PDF583May 22, 2025