MGM-Omni:將全能大型語言模型擴展至個人化長時程語音處理
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
September 29, 2025
作者: Chengyao Wang, Zhisheng Zhong, Bohao Peng, Senqiao Yang, Yuqi Liu, Haokun Gui, Bin Xia, Jingyao Li, Bei Yu, Jiaya Jia
cs.AI
摘要
我們推出MGM-Omni,這是一個統一的Omni LLM,用於全模態理解及富有表現力的長時程語音生成。與將語音合成孤立處理的級聯管道不同,MGM-Omni採用了一種“大腦-嘴巴”設計,其雙軌道、基於令牌的架構清晰地將多模態推理與實時語音生成解耦。這一設計實現了高效的跨模態交互和低延遲的流式語音生成。在理解方面,結合雙音頻編碼器設計的統一訓練策略,使得模型能在多樣化的聲學條件下進行長音頻感知。在生成方面,基於塊的並行解碼方案縮小了文本與語音令牌率之間的差距,加速了推理過程,並支持在長時間內保持音色穩定的流式零樣本語音克隆。與同期工作相比,MGM-Omni以顯著的數據效率實現了這些能力。大量實驗表明,MGM-Omni在保持長序列音色一致性、生成自然且上下文感知的語音,以及實現優異的長音頻和全模態理解方面,均超越了現有的開源模型。MGM-Omni為全模態理解和可控的個性化長時程語音生成建立了一個高效的端到端範式。
English
We present MGM-Omni, a unified Omni LLM for omni-modal understanding and
expressive, long-horizon speech generation. Unlike cascaded pipelines that
isolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with a
dual-track, token-based architecture that cleanly decouples multimodal
reasoning from real-time speech generation. This design enables efficient
cross-modal interaction and low-latency, streaming speech generation. For
understanding, a unified training strategy coupled with a dual audio encoder
design enables long-form audio perception across diverse acoustic conditions.
For generation, a chunk-based parallel decoding scheme narrows the text speech
token-rate gap, accelerating inference and supporting streaming zero-shot voice
cloning with stable timbre over extended durations. Compared to concurrent
work, MGM-Omni achieves these capabilities with markedly data-efficient
training. Extensive experiments demonstrate that MGM-Omni outperforms existing
open source models in preserving timbre identity across extended sequences,
producing natural and context-aware speech, and achieving superior long-form
audio and omnimodal understanding. MGM-Omni establishes an efficient,
end-to-end paradigm for omnimodal understanding and controllable, personalised
long-horizon speech generation.