ChatPaper.aiChatPaper

EMMA:基於統一架構的高效多模態理解、生成與編輯

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

December 4, 2025
作者: Xin He, Longhui Wei, Jianbo Ouyang, Lingxi Xie, Qi Tian
cs.AI

摘要

我們提出EMMA——一種高效統一的架構,專注於多模態理解、生成與編輯任務。具體而言,EMMA的核心設計包含四大要素:1)採用具備32倍壓縮率的高效自編碼器,大幅減少生成任務所需的標記數量,並通過對圖像施加相同壓縮比確保理解與生成任務的訓練平衡;2)在視覺理解與生成標記間採用通道維度拼接替代標記維度拼接,進一步降低統一架構中的視覺標記數量;3)設計共享解耦網絡,在滿足任務特定建模需求的同時實現跨任務協同優化;4)在視覺理解編碼器中引入專家混合機制,以極少的參數增長顯著提升感知能力。大量實驗表明,EMMA-4B在效率與性能上均顯著超越現有統一多模態方法(如BAGEL-7B),同時在與前沿專用多模態理解生成模型(如Qwen3-VL與Qwen-Image)的比較中展現競爭力。我們相信EMMA為未來統一多模態架構的發展奠定了堅實基礎。
English
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
PDF223December 9, 2025