BASS：批量注意力优化的推测抽样

摘要

推测解码已成为一种强大的方法，用于改善托管大型语言模型的延迟和吞吐量。然而，大多数现有的实现侧重于生成单个序列。现实世界中的生成式人工智能应用通常需要多个响应，如何在批处理设置中执行推测解码，同时保持其延迟优势，面临着非平凡的挑战。本文描述了一种批量推测解码系统，它在多序列生成延迟方面树立了新的技术标准，并展示了出色的GPU利用率以及在时间预算内生成的质量。例如，对于一个7.8B规模的模型，在单个A100 GPU上，批量大小为8，每个序列的平均生成速度为每个标记5.8毫秒，总吞吐量为每秒1.1K个标记。这些结果代表了最先进的延迟和比优化的常规解码快2.15倍。在常规解码无法完成的时间预算内，我们的系统能够生成具有43%的HumanEval Pass@First和61%的Pass@All的序列，远远超出了单序列推测解码的可行范围。我们在解码过程中的GPU利用率峰值高达15.8%，是常规解码的最高值的3倍以上，是单序列推测解码的大约10倍。

English

Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.

BASS：批量注意力优化的推测抽样

BASS: Batched Attention-optimized Speculative Sampling

摘要

Support