Jakiro:通过MoE实现解耦多头增强推理解码
Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE
February 10, 2025
作者: Haiduo Huang, Fuwei Yang, Zhenhua Liu, Yixing Xu, Jinze Li, Yang Liu, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum
cs.AI
摘要
推测解码(SD)通过使用较小的草稿模型来预测多个标记,然后由较大的目标模型并行验证,从而加速大型语言模型推断。然而,草稿模型的有限容量通常需要基于树的采样来提高预测准确性,在每个步骤生成多个候选项。我们确定了这种方法的一个关键局限性:同一步骤的候选项来源于相同的表示,限制了多样性并降低了整体有效性。为了解决这个问题,我们提出了Jakiro,利用专家混合(MoE),其中独立的专家生成多样化的预测,有效地解耦了候选项之间的相关性。此外,我们引入了混合推断策略,将自回归解码用于初始标记,将并行解码用于后续阶段,并通过特征中的对比机制增强后者以提高准确性。我们的方法显著提高了预测准确性,并实现了更高的推断加速。对各种模型进行的大量实验验证了我们方法的有效性和稳健性,在推测解码领域建立了新的技术水平。我们的代码可在 https://github.com/haiduo/Jakiro 上找到。
English
Speculative decoding (SD) accelerates large language model inference by using
a smaller draft model to predict multiple tokens, which are then verified in
parallel by the larger target model. However, the limited capacity of the draft
model often necessitates tree-based sampling to improve prediction accuracy,
where multiple candidates are generated at each step. We identify a key
limitation in this approach: the candidates at the same step are derived from
the same representation, limiting diversity and reducing overall effectiveness.
To address this, we propose Jakiro, leveraging Mixture of Experts (MoE), where
independent experts generate diverse predictions, effectively decoupling
correlations among candidates. Furthermore, we introduce a hybrid inference
strategy, combining autoregressive decoding for initial tokens with parallel
decoding for subsequent stages, and enhance the latter with contrastive
mechanism in features to improve accuracy. Our method significantly boosts
prediction accuracy and achieves higher inference speedups. Extensive
experiments across diverse models validate the effectiveness and robustness of
our approach, establishing a new SOTA in speculative decoding. Our codes are
available at https://github.com/haiduo/Jakiro.Summary
AI-Generated Summary