ChatPaper.aiChatPaper

UniMoE-Audio:基於動態容量專家混合模型的統一語音與音樂生成

UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

October 15, 2025
作者: Zhenyu Liu, Yunxin Li, Xuanyu Zhang, Qixun Teng, Shenyuan Jiang, Xinyu Chen, Haoyuan Shi, Jinchao Li, Qi Wang, Haolan Chen, Fanbo Meng, Mingjun Zhao, Yu Xu, Yancheng He, Baotian Hu, Min Zhang
cs.AI

摘要

近期統一多模態模型的進展顯示出全面內容生成的明確趨勢。然而,聽覺領域仍面臨重大挑戰,音樂與語音往往各自獨立發展,阻礙了通用音頻合成的進程。這種分離源於內在的任務衝突與嚴重的數據不平衡,妨礙了真正統一的音頻生成模型的開發。為應對這一挑戰,我們提出了UniMoE-Audio,這是一個基於新穎的動態容量專家混合(MoE)框架的統一語音與音樂生成模型。在架構上,UniMoE-Audio引入了Top-P路由策略以實現專家數量的動態分配,以及一種混合專家設計,包括用於領域特定知識的路由專家、用於領域無關特徵的共享專家,以及用於自適應計算跳過的無效專家。為解決數據不平衡問題,我們提出了一個三階段訓練課程:1)獨立專家訓練利用原始數據集,在不干擾的情況下將領域特定知識注入每個“原型專家”;2)MoE整合與熱身將這些專家納入UniMoE-Audio架構,使用平衡數據集子集對門控模塊和共享專家進行熱身;3)協同聯合訓練在完全平衡的數據集上端到端訓練整個模型,促進跨領域的增強協同。大量實驗表明,UniMoE-Audio不僅在主要語音與音樂生成基準上達到了最先進的性能,還展現了卓越的協同學習能力,緩解了通常見於簡單聯合訓練的性能下降。我們的研究結果凸顯了專門的MoE架構與精心策劃的訓練策略在推進通用音頻生成領域的巨大潛力。主頁:https://mukioxun.github.io/Uni-MoE-site/home.html
English
Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts and severe data imbalances, which impede the development of a truly unified audio generation model. To address this challenge, we propose UniMoE-Audio, a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework. Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic expert number allocation, and a hybrid expert design comprising routed experts for domain-specific knowledge, shared experts for domain-agnostic features, and null experts for adaptive computation skipping. To tackle data imbalance, we introduce a three-stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain-specific knowledge into each "proto-expert" without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy. Extensive experiments show that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation. Homepage: https://mukioxun.github.io/Uni-MoE-site/home.html
PDF603October 16, 2025