ChatPaper.aiChatPaper

Jakiro:通过MoE实现解耦多头增强推理解码

Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE

February 10, 2025
作者: Haiduo Huang, Fuwei Yang, Zhenhua Liu, Yixing Xu, Jinze Li, Yang Liu, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum
cs.AI

摘要

推测解码(SD)通过使用较小的草稿模型来预测多个标记,然后由较大的目标模型并行验证,从而加速大型语言模型推断。然而,草稿模型的有限容量通常需要基于树的采样来提高预测准确性,在每个步骤生成多个候选项。我们确定了这种方法的一个关键局限性:同一步骤的候选项来源于相同的表示,限制了多样性并降低了整体有效性。为了解决这个问题,我们提出了Jakiro,利用专家混合(MoE),其中独立的专家生成多样化的预测,有效地解耦了候选项之间的相关性。此外,我们引入了混合推断策略,将自回归解码用于初始标记,将并行解码用于后续阶段,并通过特征中的对比机制增强后者以提高准确性。我们的方法显著提高了预测准确性,并实现了更高的推断加速。对各种模型进行的大量实验验证了我们方法的有效性和稳健性,在推测解码领域建立了新的技术水平。我们的代码可在 https://github.com/haiduo/Jakiro 上找到。
English
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens, which are then verified in parallel by the larger target model. However, the limited capacity of the draft model often necessitates tree-based sampling to improve prediction accuracy, where multiple candidates are generated at each step. We identify a key limitation in this approach: the candidates at the same step are derived from the same representation, limiting diversity and reducing overall effectiveness. To address this, we propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions, effectively decoupling correlations among candidates. Furthermore, we introduce a hybrid inference strategy, combining autoregressive decoding for initial tokens with parallel decoding for subsequent stages, and enhance the latter with contrastive mechanism in features to improve accuracy. Our method significantly boosts prediction accuracy and achieves higher inference speedups. Extensive experiments across diverse models validate the effectiveness and robustness of our approach, establishing a new SOTA in speculative decoding. Our codes are available at https://github.com/haiduo/Jakiro.

Summary

AI-Generated Summary

PDF52February 12, 2025