MGM-Omni：将全能大语言模型扩展至个性化长程语音处理

摘要

我们推出MGM-Omni，一款统一的全模态大语言模型，旨在实现全模态理解与富有表现力的长时程语音生成。与将语音合成孤立处理的级联式管道不同，MGM-Omni采用“大脑-嘴巴”双轨设计，基于令牌的架构清晰地将多模态推理与实时语音生成解耦。这一设计促进了高效的跨模态交互与低延迟的流式语音生成。在理解方面，结合双音频编码器的统一训练策略，使模型能在多样声学条件下感知长音频。在生成方面，基于分块的并行解码方案缩小了文本与语音令牌率之间的差距，加速了推理过程，并支持在长时间内稳定音色的流式零样本语音克隆。与同期工作相比，MGM-Omni以显著的数据效率实现了这些能力。大量实验表明，MGM-Omni在保持长序列音色一致性、生成自然且上下文感知的语音，以及实现卓越的长音频与全模态理解方面，均优于现有开源模型。MGM-Omni为全模态理解与可控、个性化的长时程语音生成建立了一个高效的端到端范式。

English

We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with a dual-track, token-based architecture that cleanly decouples multimodal reasoning from real-time speech generation. This design enables efficient cross-modal interaction and low-latency, streaming speech generation. For understanding, a unified training strategy coupled with a dual audio encoder design enables long-form audio perception across diverse acoustic conditions. For generation, a chunk-based parallel decoding scheme narrows the text speech token-rate gap, accelerating inference and supporting streaming zero-shot voice cloning with stable timbre over extended durations. Compared to concurrent work, MGM-Omni achieves these capabilities with markedly data-efficient training. Extensive experiments demonstrate that MGM-Omni outperforms existing open source models in preserving timbre identity across extended sequences, producing natural and context-aware speech, and achieving superior long-form audio and omnimodal understanding. MGM-Omni establishes an efficient, end-to-end paradigm for omnimodal understanding and controllable, personalised long-horizon speech generation.

MGM-Omni：将全能大语言模型扩展至个性化长程语音处理

MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech

摘要

Support