BASS:批量注意力优化的推测抽样
BASS: Batched Attention-optimized Speculative Sampling
April 24, 2024
作者: Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop Deoras
cs.AI
摘要
推测解码已成为一种强大的方法,用于改善托管大型语言模型的延迟和吞吐量。然而,大多数现有的实现侧重于生成单个序列。现实世界中的生成式人工智能应用通常需要多个响应,如何在批处理设置中执行推测解码,同时保持其延迟优势,面临着非平凡的挑战。本文描述了一种批量推测解码系统,它在多序列生成延迟方面树立了新的技术标准,并展示了出色的GPU利用率以及在时间预算内生成的质量。例如,对于一个7.8B规模的模型,在单个A100 GPU上,批量大小为8,每个序列的平均生成速度为每个标记5.8毫秒,总吞吐量为每秒1.1K个标记。这些结果代表了最先进的延迟和比优化的常规解码快2.15倍。在常规解码无法完成的时间预算内,我们的系统能够生成具有43%的HumanEval Pass@First和61%的Pass@All的序列,远远超出了单序列推测解码的可行范围。我们在解码过程中的GPU利用率峰值高达15.8%,是常规解码的最高值的3倍以上,是单序列推测解码的大约10倍。
English
Speculative decoding has emerged as a powerful method to improve latency and
throughput in hosting large language models. However, most existing
implementations focus on generating a single sequence. Real-world generative AI
applications often require multiple responses and how to perform speculative
decoding in a batched setting while preserving its latency benefits poses
non-trivial challenges. This paper describes a system of batched speculative
decoding that sets a new state of the art in multi-sequence generation latency
and that demonstrates superior GPU utilization as well as quality of
generations within a time budget. For example, for a 7.8B-size model on a
single A100 GPU and with a batch size of 8, each sequence is generated at an
average speed of 5.8ms per token, the overall throughput being 1.1K tokens per
second. These results represent state-of-the-art latency and a 2.15X speed-up
over optimized regular decoding. Within a time budget that regular decoding
does not finish, our system is able to generate sequences with HumanEval
Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with
single-sequence speculative decoding. Our peak GPU utilization during decoding
reaches as high as 15.8%, more than 3X the highest of that of regular decoding
and around 10X of single-sequence speculative decoding.Summary
AI-Generated Summary