仙人掌算法：基于约束接受的推测采样加速自回归解码

摘要

推测性采样（SpS）通过利用较小的草稿模型，成功提升了自回归大语言模型的解码吞吐量。该技术严格约束生成分布与验证器LLM的分布保持一致，但这种约束存在不必要的严格性——验证器分布的轻微变动（如采用top-k或温度采样）通常也可被接受。典型接受采样（TAS）通过基于熵的启发式方法接受更多标记来缓解此问题，然而这种方法会扭曲验证器分布，当验证器编码关键信息时可能降低输出质量。本研究从约束优化的角度形式化推断了性采样算法，基于此提出Cactus（约束接受推测性采样），该方法能保证受控偏离验证器分布的同时提升接受率。跨多个基准测试的实证结果验证了我们方法的有效性。

English

Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variations of the verifier's distribution, such as sampling with top-k or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information. In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose Cactus (constrained acceptance speculative sampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.