UniMoE-Audio:基于动态容量专家混合模型的统一语音与音乐生成
UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE
October 15, 2025
作者: Zhenyu Liu, Yunxin Li, Xuanyu Zhang, Qixun Teng, Shenyuan Jiang, Xinyu Chen, Haoyuan Shi, Jinchao Li, Qi Wang, Haolan Chen, Fanbo Meng, Mingjun Zhao, Yu Xu, Yancheng He, Baotian Hu, Min Zhang
cs.AI
摘要
近期,统一多模态模型的进展揭示了一个明确的趋势,即向全面内容生成迈进。然而,听觉领域仍面临重大挑战,音乐与语音往往孤立发展,阻碍了通用音频合成的进程。这种分离源于内在的任务冲突与严重的数据失衡,制约了真正统一音频生成模型的发展。为应对这一挑战,我们提出了UniMoE-Audio,一个基于新型动态容量专家混合(MoE)框架的统一语音与音乐生成模型。在架构上,UniMoE-Audio引入了Top-P路由策略以实现专家数量的动态分配,以及混合专家设计,包括用于领域特定知识的路由专家、适用于跨领域特征的共享专家,以及用于自适应计算跳过的空置专家。针对数据失衡问题,我们设计了三阶段训练课程:1)独立专家训练,利用原始数据集无干扰地向每个“原型专家”灌输领域特定知识;2)MoE集成与预热,将这些专家纳入UniMoE-Audio架构,使用平衡数据集子集预热门控模块与共享专家;3)协同联合训练,在完全平衡的数据集上端到端训练整个模型,促进跨领域协同效应的增强。大量实验表明,UniMoE-Audio不仅在主要语音与音乐生成基准测试中达到了最先进的性能,还展现了卓越的协同学习能力,有效缓解了简单联合训练中常见的性能下降问题。我们的研究结果凸显了专业化MoE架构与精心设计的训练策略在推动通用音频生成领域发展中的巨大潜力。主页:https://mukioxun.github.io/Uni-MoE-site/home.html
English
Recent advances in unified multimodal models indicate a clear trend towards
comprehensive content generation. However, the auditory domain remains a
significant challenge, with music and speech often developed in isolation,
hindering progress towards universal audio synthesis. This separation stems
from inherent task conflicts and severe data imbalances, which impede the
development of a truly unified audio generation model. To address this
challenge, we propose UniMoE-Audio, a unified speech and music generation model
within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework.
Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic
expert number allocation, and a hybrid expert design comprising routed experts
for domain-specific knowledge, shared experts for domain-agnostic features, and
null experts for adaptive computation skipping. To tackle data imbalance, we
introduce a three-stage training curriculum: 1) Independent Specialist Training
leverages original datasets to instill domain-specific knowledge into each
"proto-expert" without interference; 2) MoE Integration and Warmup incorporates
these specialists into the UniMoE-Audio architecture, warming up the gate
module and shared expert using a subset of balanced dataset; and 3) Synergistic
Joint Training trains the entire model end-to-end on the fully balanced
dataset, fostering enhanced cross-domain synergy. Extensive experiments show
that UniMoE-Audio not only achieves state-of-the-art performance on major
speech and music generation benchmarks, but also demonstrates superior
synergistic learning, mitigating the performance degradation typically seen in
naive joint training. Our findings highlight the substantial potential of
specialized MoE architecture and curated training strategies in advancing the
field of universal audio generation. Homepage:
https://mukioxun.github.io/Uni-MoE-site/home.html