Ming-Omni:感知与生成统一的多模态模型
Ming-Omni: A Unified Multimodal Model for Perception and Generation
June 11, 2025
作者: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan, Lyuxin Xue, Lan Wang, Mochen Bai, Ning Gao, Pei Chen, Qingpei Guo, Qinglong Zhang, Qiang Xu, Rui Liu, Ruijie Xiong, Sirui Gao, Tinghao Liu, Taisong Li, Weilong Chai, Xinyu Xiao, Xiaomei Wang, Xiaoxue Chen, Xiao Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Yi Yuan, Yuting Gao, Yunxiao Sun, Yipeng Chen, Yifei Wu, Yongjie Lyu, Ziping Ma, Zipeng Feng, Zhijiang Fang, Zhihao Qiu, Ziyuan Huang, Zhengyu He
cs.AI
摘要
我们提出了Ming-Omni,一个统一的多模态模型,能够处理图像、文本、音频和视频,同时在语音和图像生成方面展现出强大的能力。Ming-Omni采用专用编码器从不同模态中提取特征标记,随后由Ling处理,Ling是一种配备了新提出的模态特定路由器的混合专家(MoE)架构。这一设计使得单一模型能够在统一框架内高效处理和融合多模态输入,从而无需单独模型、任务特定微调或结构重设计即可支持多样化任务。重要的是,Ming-Omni超越了传统多模态模型,通过集成先进的音频解码器实现自然语音生成,以及Ming-Lite-Uni用于高质量图像生成,使模型能够进行上下文感知的对话、执行文本到语音转换,并开展多样化的图像编辑。我们的实验结果表明,Ming-Omni为跨所有模态的统一感知与生成提供了一个强有力的解决方案。值得注意的是,我们所提出的Ming-Omni是我们所知首个在模态支持上与GPT-4o相匹配的开源模型,我们公开了所有代码和模型权重,以鼓励社区进一步的研究与开发。
English
We propose Ming-Omni, a unified multimodal model capable of processing
images, text, audio, and video, while demonstrating strong proficiency in both
speech and image generation. Ming-Omni employs dedicated encoders to extract
tokens from different modalities, which are then processed by Ling, an MoE
architecture equipped with newly proposed modality-specific routers. This
design enables a single model to efficiently process and fuse multimodal inputs
within a unified framework, thereby facilitating diverse tasks without
requiring separate models, task-specific fine-tuning, or structural redesign.
Importantly, Ming-Omni extends beyond conventional multimodal models by
supporting audio and image generation. This is achieved through the integration
of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for
high-quality image generation, which also allow the model to engage in
context-aware chatting, perform text-to-speech conversion, and conduct
versatile image editing. Our experimental results showcase Ming-Omni offers a
powerful solution for unified perception and generation across all modalities.
Notably, our proposed Ming-Omni is the first open-source model we are aware of
to match GPT-4o in modality support, and we release all code and model weights
to encourage further research and development in the community.