专家链:释放混合专家模型的通信潜力
Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
June 23, 2025
作者: Zihan Wang, Rui Pan, Jiarui Yao, Robert Csordas, Linjie Li, Lu Yin, Jiajun Wu, Tong Zhang, Manling Li, Shiwei Liu
cs.AI
摘要
我们提出了一种新的专家混合架构——专家链(Chain-of-Experts, CoE),该架构在每一层内部引入了专家间的顺序通信机制。与传统的专家混合模型不同,后者中专家并行独立运作,而CoE则通过层内专家链对输入进行迭代处理。为了支持跨迭代步骤的动态专家选择,CoE在每一层的每次迭代中均配备了一个专用路由器。这一设计使得输入能够在每次迭代时重新评估并选择不同的专家,而非静态分配。因此,CoE引入了一种灵活的路由机制,不仅增加了专家组合的多样性,还丰富了模型的表征能力。在固定计算资源下,CoE展现出性能提升:在数学推理任务中,相较于标准专家混合模型,其验证损失从1.20降至1.12。除性能外,CoE还开辟了一个新的扩展维度:通过专家迭代实现的深度扩展,这补充了传统的宽度/深度扩展方式。例如,采用2倍迭代即可达到3倍专家选择(在宽度上)的性能,同时相较于其他扩展策略,内存使用减少了17.6%至42%。我们的分析表明,CoE的优势源于其迭代残差结构及迭代路由赋予的专家专业化增强,二者共同解锁了更具表达力的表征。代码已发布于https://github.com/ZihanWang314/coe。
English
We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE)
architecture that introduces sequential expert communication within each layer.
Unlike traditional MoE models, where experts operate independently in parallel,
CoE processes tokens iteratively across a chain of experts inside a layer. To
support dynamic expert selection across iterations, CoE employs a dedicated
router at each iteration step within a layer. This design allows tokens to
re-evaluate and select different experts during each iteration, rather than
being statically assigned. As a result, CoE introduces a flexible routing
mechanism that increases the diversity of expert combinations and enriches the
model's representational capacity. CoE demonstrates improved performance under
fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to
1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling
axis: depth through expert iteration, which complements conventional
width/depth scaling. For example, using 2x iterations matches the performance
of 3x expert selections (in width), while reducing memory usage by 17.6-42%
relative to other scaling strategies. Our analysis reveals that CoE's benefits
stem from its iterative residual structure and enhanced expert specialization
empowered by iterative routing, which together unlock more expressive
representations. Code is available at https://github.com/ZihanWang314/coe.