サボテン: 制約付き受理推測サンプリングによる自己回帰デコードの高速化

要旨

speculative sampling (SpS) は、より小型のドラフトモデルを活用することで、自己回帰型大規模言語モデルのデコード処理速度の向上に成功している。SpS は、生成される分布が検証器である LLM の分布と厳密に一致することを強制する。これは、top-k や温度を用いたサンプリングなど、検証器の分布のわずかな変動も許容可能であることを考慮すると、不必要に制限的である。典型的な受理サンプリング (TAS) は、エントロピーに基づくヒューリスティクスを用いてより多くのトークンを受理することでこの問題を緩和する。しかし、この手法は検証器の分布を歪め、検証器が重要な情報を符号化している場合には出力品質の低下を招く可能性がある。本研究では、制約付き最適化の観点から speculative sampling アルゴリズムを定式化する。この定式化に基づき、検証器分布からの発散を制御しつつ受理率を向上させる手法である Cactus (constrained acceptance speculative sampling) を提案する。様々なベンチマークによる実験結果は、本手法の有効性を確認している。

English

Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variations of the verifier's distribution, such as sampling with top-k or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information. In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose Cactus (constrained acceptance speculative sampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.

サボテン: 制約付き受理推測サンプリングによる自己回帰デコードの高速化

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

要旨

Support