ChatPaper.aiChatPaper

EMMA:基于统一架构的高效多模态理解、生成与编辑系统

EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture

December 4, 2025
作者: Xin He, Longhui Wei, Jianbo Ouyang, Lingxi Xie, Qi Tian
cs.AI

摘要

我们提出EMMA——一种高效统一的多模态理解、生成与编辑架构。具体而言,EMMA主要包含四大核心设计:1)采用具有32倍压缩率的高效自编码器,显著减少生成任务所需的token数量,同时通过图像同等压缩比确保理解与生成任务的训练平衡;2)在视觉理解与生成token间采用通道级拼接而非token级拼接,进一步减少统一架构中的视觉token数量;3)共享解耦网络在满足任务特定建模需求的同时,实现跨任务的相互促进;4)视觉理解编码器引入专家混合机制,以少量参数提升显著增强感知能力。大量实验表明,EMMA-4B在效率与性能上显著超越当前最先进统一多模态方法(如BAGEL-7B),同时与前沿多模态理解生成专家模型(如Qwen3-VL和Qwen-Image)相比也具备竞争力。我们相信EMMA为未来统一多模态架构的发展奠定了坚实基础。
English
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase. Extensive experiments have shown that EMMA-4B can significantly outperform state-of-the-art unified multimodal approaches (e.g., BAGEL-7B) in both efficiency and performance, while also achieving competitive results compared to recent multimodal understanding and generation experts (e.g., Qwen3-VL and Qwen-Image). We believe that EMMA lays a solid foundation for the future development of unified multimodal architectures.
PDF223December 9, 2025