擴展嵌入模型勝於擴展專家模型於語言模型中的表現
Scaling Embeddings Outperforms Scaling Experts in Language Models
January 29, 2026
作者: Hong Liu, Jiaqi Zhang, Chao Wang, Xing Hu, Linkun Lyu, Jiaqi Sun, Xurui Yang, Bo Wang, Fengcun Li, Yulei Qian, Lingtong Si, Yerui Sun, Rumei Li, Peng Pei, Yuchen Xie, Xunliang Cai
cs.AI
摘要
雖然混合專家模型已成為大型語言模型中稀疏擴展的標準架構,但其正面臨邊際效益遞減與系統層級瓶頸的雙重挑戰。本研究探討嵌入擴展作為一種高效且正交的稀疏擴展維度。透過全面分析與實驗,我們確定了嵌入擴展在特定情境下能較專家擴展獲得更優帕累托前沿的關鍵條件。我們系統性地剖析了主導此效能的關鍵架構因素——從參數預算分配,到與模型寬度及深度的相互作用。更進一步地,透過整合定制化系統優化與預測解碼技術,我們成功將此稀疏性轉化為實際的推理加速效益。基於這些發現,我們提出LongCat-Flash-Lite模型:一個具有680.5億總參數、約30億激活參數的從頭訓練模型。儘管該模型為嵌入層分配了超過300億參數,LongCat-Flash-Lite不僅在同等參數規模下超越混合專家基準模型,更在智能體與程式碼生成領域展現出與現有同規模模型相匹敵的卓越競爭力。
English
While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.