MGM-Omni: オムニLLMをパーソナライズされた長期視野の音声処理へ拡張

要旨

我々は、オムニモーダル理解と表現力豊かな長期的音声生成のための統一型Omni LLMであるMGM-Omniを提案する。音声合成を分離するカスケード型パイプラインとは異なり、MGM-Omniは「脳-口」設計を採用し、デュアルトラックのトークンベースアーキテクチャにより、マルチモーダル推論とリアルタイム音声生成を明確に分離する。この設計により、効率的なクロスモーダル相互作用と低遅延のストリーミング音声生成が可能となる。理解のためには、統一されたトレーニング戦略とデュアルオーディオエンコーダ設計により、多様な音響条件下での長尺音声知覚を実現する。生成のためには、チャンクベースの並列デコードスキームにより、テキストと音声のトークンレートギャップを狭め、推論を加速し、長時間にわたる安定した音色でのストリーミングゼロショット音声クローニングをサポートする。同時期の研究と比較して、MGM-Omniはこれらの能力を著しくデータ効率的なトレーニングで達成する。広範な実験により、MGM-Omniが既存のオープンソースモデルを上回り、長尺シーケンスにわたる音色同一性の保持、自然で文脈を意識した音声の生成、優れた長尺音声およびオムニモーダル理解を実現することが示された。MGM-Omniは、オムニモーダル理解と制御可能でパーソナライズされた長期的音声生成のための効率的なエンドツーエンドパラダイムを確立する。

English

We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with a dual-track, token-based architecture that cleanly decouples multimodal reasoning from real-time speech generation. This design enables efficient cross-modal interaction and low-latency, streaming speech generation. For understanding, a unified training strategy coupled with a dual audio encoder design enables long-form audio perception across diverse acoustic conditions. For generation, a chunk-based parallel decoding scheme narrows the text speech token-rate gap, accelerating inference and supporting streaming zero-shot voice cloning with stable timbre over extended durations. Compared to concurrent work, MGM-Omni achieves these capabilities with markedly data-efficient training. Extensive experiments demonstrate that MGM-Omni outperforms existing open source models in preserving timbre identity across extended sequences, producing natural and context-aware speech, and achieving superior long-form audio and omnimodal understanding. MGM-Omni establishes an efficient, end-to-end paradigm for omnimodal understanding and controllable, personalised long-horizon speech generation.

MGM-Omni: オムニLLMをパーソナライズされた長期視野の音声処理へ拡張

MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech

要旨

Support