多重草案推测抽样：规范架构与理论极限

摘要

我们考虑多次草案的推测抽样，其中提议序列是从不同草案模型独立抽样的。在每一步中，一个基于标记级别的草案选择方案接受有效标记列表作为输入，并生成一个输出标记，其分布与目标模型相匹配。先前的研究表明，最优方案（最大化接受输入标记之一的概率）可以被视为线性规划的解决方案。在这项工作中，我们展示了最优方案可以分解为两步解决方案：在第一步中，使用一种重要性抽样（IS）类型方案选择一个中间标记；在第二步中，应用（单次草案）推测抽样来生成输出标记。对于两个相同的草案模型的情况，我们进一步1）确定目标模型和草案模型的分布使得接受概率等于一的必要和充分条件，2）提供最优接受概率的显式表达式。我们的理论分析还推动了一类基于加权重要性抽样的标记级别选择方案。我们的实验结果表明，在许多场景中，与基准方案相比，可实现的块效率和标记速率均有一致改进。

English

We consider multi-draft speculative sampling, where the proposal sequences are sampled independently from different draft models. At each step, a token-level draft selection scheme takes a list of valid tokens as input and produces an output token whose distribution matches that of the target model. Previous works have demonstrated that the optimal scheme (which maximizes the probability of accepting one of the input tokens) can be cast as a solution to a linear program. In this work we show that the optimal scheme can be decomposed into a two-step solution: in the first step an importance sampling (IS) type scheme is used to select one intermediate token; in the second step (single-draft) speculative sampling is applied to generate the output token. For the case of two identical draft models we further 1) establish a necessary and sufficient condition on the distributions of the target and draft models for the acceptance probability to equal one and 2) provide an explicit expression for the optimal acceptance probability. Our theoretical analysis also motives a new class of token-level selection scheme based on weighted importance sampling. Our experimental results demonstrate consistent improvements in the achievable block efficiency and token rates over baseline schemes in a number of scenarios.

多重草案推测抽样：规范架构与理论极限

Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits

摘要

Support