專家鏈：釋放混合專家模型的通訊潛能

摘要

我們提出了專家鏈（Chain-of-Experts, CoE），這是一種新的專家混合（Mixture-of-Experts, MoE）架構，它在每一層內引入了專家之間的序列化通信。與傳統的MoE模型不同，在傳統模型中專家是並行獨立運作的，而CoE則是在層內通過專家鏈對token進行迭代處理。為了支持跨迭代的動態專家選擇，CoE在每一層的每個迭代步驟中採用了一個專用的路由器。這種設計使得token在每次迭代中能夠重新評估並選擇不同的專家，而不是被靜態分配。因此，CoE引入了一種靈活的路由機制，增加了專家組合的多樣性，並豐富了模型的表示能力。在固定計算資源下，CoE展現了性能的提升：在數學推理任務中，與標準MoE相比，它將驗證損失從1.20降低到了1.12。除了性能提升外，CoE還提供了一個新的擴展維度：通過專家迭代實現的深度，這與傳統的寬度/深度擴展相輔相成。例如，使用2倍迭代次數即可匹配3倍專家選擇（在寬度上）的性能，同時相對於其他擴展策略，記憶體使用量減少了17.6-42%。我們的分析表明，CoE的優勢源於其迭代殘差結構以及由迭代路由賦能的專家專業化增強，這兩者共同釋放了更具表達力的表示。程式碼可在https://github.com/ZihanWang314/coe獲取。

English

We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model's representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE's benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.

專家鏈：釋放混合專家模型的通訊潛能

Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models

摘要

Support