ARMOR v0.1:通過非對稱協同實現交錯多模態生成,增強自回歸多模態理解模型
ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy
March 9, 2025
作者: Jianwen Sun, Yukang Feng, Chuanhao Li, Fanrui Zhang, Zizhen Li, Jiaxin Ai, Sizhuo Zhou, Yu Dai, Shenglin Zhang, Kaipeng Zhang
cs.AI
摘要
統一模型(UniMs)在多模態理解與生成領域近期受到了視覺與語言研究界的廣泛關注。現有的UniMs旨在同時學習多模態理解與生成能力,這需要大量的計算資源,並且在生成交錯的文本與圖像時往往面臨挑戰。我們提出了ARMOR,這是一個資源高效且純自迴歸的框架,通過微調現有的多模態大語言模型(MLLMs)來實現理解與生成。具體而言,ARMOR從三個方面擴展了現有的MLLMs:(1)在模型架構上,引入了一種帶有前向切換機制的不對稱編碼器-解碼器架構,以統一整合文本與視覺模態的嵌入空間,從而實現自然的文本-圖像交錯生成,同時最小化計算開銷。(2)在訓練數據方面,精心收集了一個高質量的交錯數據集,用於微調MLLMs。(3)在訓練算法上,我們提出了一種“生成什麼或如何生成”的算法,通過基於所收集數據集的三個漸進訓練階段,賦予現有MLLMs多模態生成能力,同時保留其多模態理解能力。實驗結果表明,ARMOR利用有限的訓練資源,將現有的MLLMs升級為具有良好圖像生成能力的UniMs。我們的代碼將於近期在https://armor.github.io發布。
English
Unified models (UniMs) for multimodal understanding and generation have
recently received much attention in the area of vision and language. Existing
UniMs are designed to simultaneously learn both multimodal understanding and
generation capabilities, demanding substantial computational resources, and
often struggle to generate interleaved text-image. We present ARMOR, a
resource-efficient and pure autoregressive framework that achieves both
understanding and generation by fine-tuning existing multimodal large language
models (MLLMs). Specifically, ARMOR extends existing MLLMs from three
perspectives: (1) For model architecture, an asymmetric encoder-decoder
architecture with a forward-switching mechanism is introduced to unify
embedding space integrating textual and visual modalities for enabling natural
text-image interleaved generation with minimal computational overhead. (2) For
training data, a meticulously curated, high-quality interleaved dataset is
collected for fine-tuning MLLMs. (3) For the training algorithm, we propose a
``what or how to generate" algorithm to empower existing MLLMs with multimodal
generation capabilities while preserving their multimodal understanding
capabilities, through three progressive training stages based on the collected
dataset. Experimental results demonstrate that ARMOR upgrades existing MLLMs to
UniMs with promising image generation capabilities, using limited training
resources. Our code will be released soon at https://armor.github.io.Summary
AI-Generated Summary