ChatPaper.aiChatPaper

Ming-Omni:感知與生成的統一多模態模型

Ming-Omni: A Unified Multimodal Model for Perception and Generation

June 11, 2025
作者: Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, GuangMing Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan, Lyuxin Xue, Lan Wang, Mochen Bai, Ning Gao, Pei Chen, Qingpei Guo, Qinglong Zhang, Qiang Xu, Rui Liu, Ruijie Xiong, Sirui Gao, Tinghao Liu, Taisong Li, Weilong Chai, Xinyu Xiao, Xiaomei Wang, Xiaoxue Chen, Xiao Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Yi Yuan, Yuting Gao, Yunxiao Sun, Yipeng Chen, Yifei Wu, Yongjie Lyu, Ziping Ma, Zipeng Feng, Zhijiang Fang, Zhihao Qiu, Ziyuan Huang, Zhengyu He
cs.AI

摘要

我們提出Ming-Omni,這是一個統一的多模態模型,能夠處理圖像、文本、音頻和視頻,並在語音和圖像生成方面展現出強大的能力。Ming-Omni採用專用編碼器從不同模態中提取特徵,然後由Ling處理,這是一個配備了新提出的模態特定路由器的混合專家(MoE)架構。這種設計使得單一模型能夠在統一框架內高效處理和融合多模態輸入,從而促進多樣化任務的完成,而無需單獨的模型、任務特定的微調或結構重新設計。重要的是,Ming-Omni超越了傳統的多模態模型,支持音頻和圖像生成。這是通過集成先進的音頻解碼器以生成自然語音,以及Ming-Lite-Uni以生成高質量圖像來實現的,這也使得模型能夠進行上下文感知的聊天、執行文本到語音的轉換,並進行多功能的圖像編輯。我們的實驗結果展示了Ming-Omni為所有模態的統一感知和生成提供了一個強大的解決方案。值得注意的是,我們提出的Ming-Omni是我們所知的第一個在模態支持上與GPT-4o相匹配的開源模型,我們發布了所有代碼和模型權重,以鼓勵社區進一步的研究和開發。
English
We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.
PDF182June 13, 2025