专家链：释放混合专家模型的通信潜力

摘要

我们提出了一种新的专家混合架构——专家链（Chain-of-Experts, CoE），该架构在每一层内部引入了专家间的顺序通信机制。与传统的专家混合模型不同，后者中专家并行独立运作，而CoE则通过层内专家链对输入进行迭代处理。为了支持跨迭代步骤的动态专家选择，CoE在每一层的每次迭代中均配备了一个专用路由器。这一设计使得输入能够在每次迭代时重新评估并选择不同的专家，而非静态分配。因此，CoE引入了一种灵活的路由机制，不仅增加了专家组合的多样性，还丰富了模型的表征能力。在固定计算资源下，CoE展现出性能提升：在数学推理任务中，相较于标准专家混合模型，其验证损失从1.20降至1.12。除性能外，CoE还开辟了一个新的扩展维度：通过专家迭代实现的深度扩展，这补充了传统的宽度/深度扩展方式。例如，采用2倍迭代即可达到3倍专家选择（在宽度上）的性能，同时相较于其他扩展策略，内存使用减少了17.6%至42%。我们的分析表明，CoE的优势源于其迭代残差结构及迭代路由赋予的专家专业化增强，二者共同解锁了更具表达力的表征。代码已发布于https://github.com/ZihanWang314/coe。

English

We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model's representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE's benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.

专家链：释放混合专家模型的通信潜力

Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models

摘要

Support