ChatPaper.aiChatPaper

Jakiro:透過MoE提升具分離多頭的推理解碼

Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE

February 10, 2025
作者: Haiduo Huang, Fuwei Yang, Zhenhua Liu, Yixing Xu, Jinze Li, Yang Liu, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum
cs.AI

摘要

推測解碼(SD)通過使用較小的草稿模型來預測多個標記,然後由較大的目標模型並行驗證,加速大型語言模型的推斷。然而,草稿模型的有限容量通常需要基於樹的採樣來提高預測準確性,在每個步驟生成多個候選項。我們確定了這種方法的一個關鍵限制:同一步驟的候選項來自相同的表示,限制了多樣性並降低了整體效果。為了解決這個問題,我們提出了Jakiro,利用專家混合(MoE),其中獨立專家生成多樣的預測,有效地解耦了候選項之間的相關性。此外,我們引入了混合推斷策略,將自回歸解碼用於初始標記,並將並行解碼用於後續階段,並通過特徵中的對比機制增強後者以提高準確性。我們的方法顯著提高了預測準確性,實現了更高的推斷加速。通過對各種模型的廣泛實驗驗證了我們方法的有效性和韌性,確立了在推測解碼中的新 SOTA。我們的代碼可在 https://github.com/haiduo/Jakiro 上找到。
English
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens, which are then verified in parallel by the larger target model. However, the limited capacity of the draft model often necessitates tree-based sampling to improve prediction accuracy, where multiple candidates are generated at each step. We identify a key limitation in this approach: the candidates at the same step are derived from the same representation, limiting diversity and reducing overall effectiveness. To address this, we propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions, effectively decoupling correlations among candidates. Furthermore, we introduce a hybrid inference strategy, combining autoregressive decoding for initial tokens with parallel decoding for subsequent stages, and enhance the latter with contrastive mechanism in features to improve accuracy. Our method significantly boosts prediction accuracy and achieves higher inference speedups. Extensive experiments across diverse models validate the effectiveness and robustness of our approach, establishing a new SOTA in speculative decoding. Our codes are available at https://github.com/haiduo/Jakiro.

Summary

AI-Generated Summary

PDF52February 12, 2025