ChatPaper.aiChatPaper

MGM-Omni:将全能大语言模型扩展至个性化长程语音处理

MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech

September 29, 2025
作者: Chengyao Wang, Zhisheng Zhong, Bohao Peng, Senqiao Yang, Yuqi Liu, Haokun Gui, Bin Xia, Jingyao Li, Bei Yu, Jiaya Jia
cs.AI

摘要

我们推出MGM-Omni,一款统一的全模态大语言模型,旨在实现全模态理解与富有表现力的长时程语音生成。与将语音合成孤立处理的级联式管道不同,MGM-Omni采用“大脑-嘴巴”双轨设计,基于令牌的架构清晰地将多模态推理与实时语音生成解耦。这一设计促进了高效的跨模态交互与低延迟的流式语音生成。在理解方面,结合双音频编码器的统一训练策略,使模型能在多样声学条件下感知长音频。在生成方面,基于分块的并行解码方案缩小了文本与语音令牌率之间的差距,加速了推理过程,并支持在长时间内稳定音色的流式零样本语音克隆。与同期工作相比,MGM-Omni以显著的数据效率实现了这些能力。大量实验表明,MGM-Omni在保持长序列音色一致性、生成自然且上下文感知的语音,以及实现卓越的长音频与全模态理解方面,均优于现有开源模型。MGM-Omni为全模态理解与可控、个性化的长时程语音生成建立了一个高效的端到端范式。
English
We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with a dual-track, token-based architecture that cleanly decouples multimodal reasoning from real-time speech generation. This design enables efficient cross-modal interaction and low-latency, streaming speech generation. For understanding, a unified training strategy coupled with a dual audio encoder design enables long-form audio perception across diverse acoustic conditions. For generation, a chunk-based parallel decoding scheme narrows the text speech token-rate gap, accelerating inference and supporting streaming zero-shot voice cloning with stable timbre over extended durations. Compared to concurrent work, MGM-Omni achieves these capabilities with markedly data-efficient training. Extensive experiments demonstrate that MGM-Omni outperforms existing open source models in preserving timbre identity across extended sequences, producing natural and context-aware speech, and achieving superior long-form audio and omnimodal understanding. MGM-Omni establishes an efficient, end-to-end paradigm for omnimodal understanding and controllable, personalised long-horizon speech generation.
PDF112September 30, 2025