BanditSpec：基于强盗算法的自适应推测解码

摘要

推測解碼已成為加速大型語言模型（LLM）推理同時保持其卓越文本生成性能的流行方法。以往的方法要么採用固定的推測解碼配置，不考慮前綴詞彙，要么通過離線或在線方式訓練草稿模型以使其與上下文對齊。本文提出了一種無需訓練的在線學習框架，能夠在文本生成過程中自適應地選擇推測解碼的超參數配置。我們首先將這一超參數選擇問題形式化為多臂老虎機問題，並提供了一個通用的推測解碼框架BanditSpec。此外，設計並分析了兩種基於老虎機的超參數選擇算法UCBSpec和EXP3Spec，並從一個新穎的量度——停止時間遺憾——進行了分析。我們在隨機和對抗性獎勵設置下對這一遺憾進行了上界分析。通過推導信息論上的不可能性結果，表明UCBSpec的遺憾性能在通用常數範圍內是最優的。最後，通過LLaMA3和Qwen2的大量實證實驗證明，與現有方法相比，我們的算法是有效的，並且在模擬真實LLM服務場景中，面對多樣化的輸入提示，其吞吐量接近於最佳超參數的預言值。

English

Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.

BanditSpec：基于强盗算法的自适应推测解码

BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

摘要

Support