BanditSpec：基于Bandit算法的自适应推测解码

摘要

推测解码已成为加速大型语言模型（LLMs）推理同时保持其卓越文本生成性能的流行方法。以往的方法要么采用固定的推测解码配置，无视前缀令牌，要么通过离线或在线方式训练草稿模型以使其与上下文对齐。本文提出了一种无需训练、在线学习的框架，能够在文本生成过程中自适应地选择推测解码的超参数配置。我们首先将这一超参数选择问题形式化为多臂赌博机问题，并提供了一个通用的推测解码框架——BanditSpec。此外，设计了两种基于赌博机的超参数选择算法，UCBSpec和EXP3Spec，并针对一种新颖的量度——停止时间遗憾进行了分析。我们在随机和对抗性奖励设置下，对这一遗憾进行了上界分析。通过推导信息论上的不可能性结果，表明UCBSpec的遗憾性能在通用常数范围内是最优的。最后，利用LLaMA3和Qwen2进行的大量实证实验表明，与现有方法相比，我们的算法效果显著，在模拟真实LLM服务场景中，面对多样化的输入提示，其吞吐量接近最佳超参数下的理想值。

English

Speculative decoding has emerged as a popular method to accelerate the inference of Large Language Models (LLMs) while retaining their superior text generation performance. Previous methods either adopt a fixed speculative decoding configuration regardless of the prefix tokens, or train draft models in an offline or online manner to align them with the context. This paper proposes a training-free online learning framework to adaptively choose the configuration of the hyperparameters for speculative decoding as text is being generated. We first formulate this hyperparameter selection problem as a Multi-Armed Bandit problem and provide a general speculative decoding framework BanditSpec. Furthermore, two bandit-based hyperparameter selection algorithms, UCBSpec and EXP3Spec, are designed and analyzed in terms of a novel quantity, the stopping time regret. We upper bound this regret under both stochastic and adversarial reward settings. By deriving an information-theoretic impossibility result, it is shown that the regret performance of UCBSpec is optimal up to universal constants. Finally, extensive empirical experiments with LLaMA3 and Qwen2 demonstrate that our algorithms are effective compared to existing methods, and the throughput is close to the oracle best hyperparameter in simulated real-life LLM serving scenarios with diverse input prompts.

BanditSpec：基于Bandit算法的自适应推测解码

BanditSpec: Adaptive Speculative Decoding via Bandit Algorithms

摘要

Support