ARMOR v0.1:通过非对称协同赋能自回归多模态理解模型,实现交错多模态生成
ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy
March 9, 2025
作者: Jianwen Sun, Yukang Feng, Chuanhao Li, Fanrui Zhang, Zizhen Li, Jiaxin Ai, Sizhuo Zhou, Yu Dai, Shenglin Zhang, Kaipeng Zhang
cs.AI
摘要
统一模型(UniMs)在视觉与语言领域中的多模态理解与生成能力近期备受关注。现有的UniMs旨在同时学习多模态理解与生成能力,这要求大量的计算资源,且常难以实现文本与图像的交替生成。我们提出了ARMOR,一个资源高效且纯自回归的框架,通过微调现有的多模态大语言模型(MLLMs)来实现理解与生成的双重目标。具体而言,ARMOR从三个方面扩展了现有MLLMs:(1)在模型架构上,引入了一种带有前向切换机制的非对称编码器-解码器架构,以统一文本与视觉模态的嵌入空间,从而在最小计算开销下实现自然的文本-图像交替生成。(2)在训练数据方面,精心收集了一个高质量交替数据集用于微调MLLMs。(3)在训练算法上,我们提出了一种“生成什么或如何生成”的算法,通过基于收集数据集的三个渐进训练阶段,赋予现有MLLMs多模态生成能力,同时保留其多模态理解能力。实验结果表明,ARMOR利用有限的训练资源,将现有MLLMs升级为具备良好图像生成能力的UniMs。我们的代码即将发布于https://armor.github.io。
English
Unified models (UniMs) for multimodal understanding and generation have
recently received much attention in the area of vision and language. Existing
UniMs are designed to simultaneously learn both multimodal understanding and
generation capabilities, demanding substantial computational resources, and
often struggle to generate interleaved text-image. We present ARMOR, a
resource-efficient and pure autoregressive framework that achieves both
understanding and generation by fine-tuning existing multimodal large language
models (MLLMs). Specifically, ARMOR extends existing MLLMs from three
perspectives: (1) For model architecture, an asymmetric encoder-decoder
architecture with a forward-switching mechanism is introduced to unify
embedding space integrating textual and visual modalities for enabling natural
text-image interleaved generation with minimal computational overhead. (2) For
training data, a meticulously curated, high-quality interleaved dataset is
collected for fine-tuning MLLMs. (3) For the training algorithm, we propose a
``what or how to generate" algorithm to empower existing MLLMs with multimodal
generation capabilities while preserving their multimodal understanding
capabilities, through three progressive training stages based on the collected
dataset. Experimental results demonstrate that ARMOR upgrades existing MLLMs to
UniMs with promising image generation capabilities, using limited training
resources. Our code will be released soon at https://armor.github.io.Summary
AI-Generated Summary