ChatPaper.aiChatPaper

Uni-MoE-2.0-Omni:基於先進混合專家模型、訓練與數據的語言核心全能模態大型模型擴展

Uni-MoE-2.0-Omni: Scaling Language-Centric Omnimodal Large Model with Advanced MoE, Training and Data

November 16, 2025
作者: Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, Baotian Hu, Min Zhang
cs.AI

摘要

我們推出荔枝家族系列的全新力作——Uni-MoE 2.0。作為完全開源的全模態大模型(OLM),該模型在語言核心的多模態理解、推理與生成能力上顯著推進了荔枝Uni-MoE系列的技術邊界。基於Qwen2.5-7B的稠密架構,我們通過三大核心創新從零構建了Uni-MoE-2.0-Omni:動態容量的混合專家(MoE)設計、結合迭代強化策略的漸進式訓練方法,以及精心構建的多模態數據匹配技術。該模型具備全模態理解能力,並能生成圖像、文本與語音。在架構層面,我們的新型MoE框架通過共享專家、路由專家與空置專家的協同機制,在處理10種跨模態輸入時實現了計算效率與性能的平衡;而全模態三維旋轉位置編碼(Omni-Modality 3D RoPE)則確保自注意力層中的時空跨模態對齊。訓練策略上,在跨模態預訓練後,我們採用漸進式監督微調策略激活模態專屬專家,並通過均衡數據組合與創新的GSPO-DPO迭代方法強化訓練穩定性與推理能力。數據方面,基於約750億標記的開源多模態數據訓練的基礎模型,配備專用語音與圖像生成標記,使其能根據語言線索條件化輸出以學習生成任務。在85項基準測試的廣泛評估中,本模型在與主流OLM的對比中取得SOTA或極具競爭力的表現,於76項測試中超過50項超越Qwen2.5-Omni(訓練標記量達1.2萬億)。核心優勢包括影片理解(8項任務平均提升7%)、全模態理解(4項任務平均提升7%)與視聽推理(提升4%),同時在長語音處理(字錯誤率降低4.2%)、底層圖像處理及可控生成(5項指標領先)領域實現突破。
English
We present Uni-MoE 2.0 from the Lychee family. As a fully open-source omnimodal large model (OLM), it substantially advances Lychee's Uni-MoE series in language-centric multimodal understanding, reasoning, and generating. Based on the Qwen2.5-7B dense architecture, we build Uni-MoE-2.0-Omni from scratch through three core contributions: dynamic-capacity Mixture-of-Experts (MoE) design, a progressive training strategy enhanced with an iterative reinforcement strategy, and a carefully curated multimodal data matching technique. It is capable of omnimodal understanding, as well as generating images, text, and speech. Architecturally, our new MoE framework balances computational efficiency and capability for 10 cross-modal inputs using shared, routed, and null experts, while our Omni-Modality 3D RoPE ensures spatio-temporal cross-modality alignment in the self-attention layer. For training, following cross-modal pretraining, we use a progressive supervised fine-tuning strategy that activates modality-specific experts and is enhanced by balanced data composition and an iterative GSPO-DPO method to stabilise RL training and improve reasoning. Data-wise, the base model, trained on approximately 75B tokens of open-source multimodal data, is equipped with special speech and image generation tokens, allowing it to learn these generative tasks by conditioning its outputs on linguistic cues. Extensive evaluation across 85 benchmarks demonstrates that our model achieves SOTA or highly competitive performance against leading OLMs, surpassing Qwen2.5-Omni (trained with 1.2T tokens) on over 50 of 76 benchmarks. Key strengths include video understanding (+7% avg. of 8), omnimodallity understanding (+7% avg. of 4), and audiovisual reasoning (+4%). It also advances long-form speech processing (reducing WER by 4.2%) and leads in low-level image processing and controllable generation across 5 metrics.
PDF1013December 1, 2025