マルチドラフト仮説サンプリング：標準アーキテクチャと理論的限界

要旨

複数のドラフトモデルから独立して提案シーケンスがサンプリングされる多段階の仮説サンプリングを考えます。各ステップで、トークンレベルのドラフト選択スキームが有効なトークンのリストを入力として受け取り、出力トークンを生成します。この出力トークンの分布は、対象モデルの分布と一致します。以前の研究では、最適なスキーム（入力トークンのいずれかを受け入れる確率を最大化するもの）は、線形計画問題の解として表現できることが示されています。本研究では、最適なスキームを2段階の解に分解できることを示します。最初のステップでは、重要度サンプリング（IS）型スキームが使用されて1つの中間トークンが選択され、次に（単一ドラフトの）仮説サンプリングが適用されて出力トークンが生成されます。2つの同一のドラフトモデルの場合には、さらに以下のことを行います：1）対象モデルとドラフトモデルの分布について受容確率が1と等しくなるための必要かつ十分条件を確立し、2）最適な受容確率の明示的な表現を提供します。理論的な分析は、重み付き重要度サンプリングに基づく新しいクラスのトークンレベル選択スキームを促進します。実験結果は、多くのシナリオでベースラインスキームに比べて達成可能なブロック効率とトークンレートの一貫した改善を示しています。

English

We consider multi-draft speculative sampling, where the proposal sequences are sampled independently from different draft models. At each step, a token-level draft selection scheme takes a list of valid tokens as input and produces an output token whose distribution matches that of the target model. Previous works have demonstrated that the optimal scheme (which maximizes the probability of accepting one of the input tokens) can be cast as a solution to a linear program. In this work we show that the optimal scheme can be decomposed into a two-step solution: in the first step an importance sampling (IS) type scheme is used to select one intermediate token; in the second step (single-draft) speculative sampling is applied to generate the output token. For the case of two identical draft models we further 1) establish a necessary and sufficient condition on the distributions of the target and draft models for the acceptance probability to equal one and 2) provide an explicit expression for the optimal acceptance probability. Our theoretical analysis also motives a new class of token-level selection scheme based on weighted importance sampling. Our experimental results demonstrate consistent improvements in the achievable block efficiency and token rates over baseline schemes in a number of scenarios.