ChatPaper.aiChatPaper

Uni-MoE-2.0-Omni:基于先进MoE架构、训练与数据的大规模语言核心全模态模型

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

November 16, 2025
作者: Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang
cs.AI

摘要

我们推出荔枝家族系列新品Uni-MoE 2.0。作为完全开源的全模态大模型(OLM),该模型在语言核心的多模态理解、推理与生成能力上显著推进了荔枝Uni-MoE系列的技术边界。基于Qwen2.5-7B稠密架构,我们通过三大核心贡献从零构建了Uni-MoE-2.0-Omni:动态容量的混合专家(MoE)设计、结合迭代强化策略的渐进式训练方法,以及精心构建的多模态数据匹配技术。该模型具备全模态理解能力,并可生成图像、文本与语音。在架构层面,新型MoE框架通过共享专家、路由专家与空置专家的协同机制,在10种跨模态输入场景中平衡计算效率与模型能力;而全模态3D旋转位置编码(Omni-Modality 3D RoPE)则确保自注意力层的时空跨模态对齐。训练策略上,在跨模态预训练后采用渐进式监督微调,激活模态专属专家模块,并通过均衡数据组合与迭代式GSPO-DPO方法增强训练稳定性与推理能力。数据方面,基座模型在约750亿token的开源多模态数据上训练,配备专用语音与图像生成标记,使其能基于语言线索学习生成任务。在85个基准测试的广泛评估中,本模型在领先全模态大模型中实现SOTA或极具竞争力的性能,在76个基准中超过50项超越Qwen2.5-Omni(训练token量1.2万亿)。核心优势包括视频理解(8项任务平均提升7%)、全模态理解(4项任务平均提升7%)及音视频推理(提升4%),同时实现了长语音处理(词错误率降低4.2%)的突破,并在5项指标上领跑底层图像处理与可控生成任务。
English
We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.
PDF1013December 1, 2025