扩展嵌入层在语言模型中表现优于扩展专家网络
Scaling Embeddings Outperforms Scaling Experts in Language Models
January 29, 2026
作者: Hong Liu, Jiaqi Zhang, Chao Wang, Xing Hu, Linkun Lyu, Jiaqi Sun, Xurui Yang, Bo Wang, Fengcun Li, Yulei Qian, Lingtong Si, Yerui Sun, Rumei Li, Peng Pei, Yuchen Xie, Xunliang Cai
cs.AI
摘要
尽管混合专家(MoE)架构已成为大语言模型稀疏性扩展的标准方案,但其正面临收益递减和系统级瓶颈的挑战。本研究探索了嵌入缩放作为稀疏性扩展中一个强效且正交的维度。通过全面分析与实验,我们识别出嵌入缩放在特定场景下能比专家缩放获得更优的帕累托前沿。我们系统性地揭示了影响该方案效能的关键架构因素——从参数预算分配到与模型宽度、深度的相互作用。此外,通过整合定制化系统优化与推测解码技术,我们成功将这种稀疏性转化为实际的推理加速。基于这些发现,我们提出了LongCat-Flash-Lite模型:一个具有约30亿激活参数、总参数量达685亿的全新训练模型。尽管该模型为嵌入层分配了超过300亿参数,LongCat-Flash-Lite不仅超越了参数规模相当的MoE基线模型,更在智能体与代码生成领域展现出与同规模现有模型相比的卓越竞争力。
English
While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.