ChatPaper.aiChatPaper

BASS:批量注意力优化的推测抽样

BASS: Batched Attention-optimized Speculative Sampling

April 24, 2024
作者: Haifeng Qian, Sujan Kumar Gonugondla, Sungsoo Ha, Mingyue Shang, Sanjay Krishna Gouda, Ramesh Nallapati, Sudipta Sengupta, Xiaofei Ma, Anoop Deoras
cs.AI

摘要

推测解码已成为一种强大的方法,用于改善托管大型语言模型的延迟和吞吐量。然而,大多数现有的实现侧重于生成单个序列。现实世界中的生成式人工智能应用通常需要多个响应,如何在批处理设置中执行推测解码,同时保持其延迟优势,面临着非平凡的挑战。本文描述了一种批量推测解码系统,它在多序列生成延迟方面树立了新的技术标准,并展示了出色的GPU利用率以及在时间预算内生成的质量。例如,对于一个7.8B规模的模型,在单个A100 GPU上,批量大小为8,每个序列的平均生成速度为每个标记5.8毫秒,总吞吐量为每秒1.1K个标记。这些结果代表了最先进的延迟和比优化的常规解码快2.15倍。在常规解码无法完成的时间预算内,我们的系统能够生成具有43%的HumanEval Pass@First和61%的Pass@All的序列,远远超出了单序列推测解码的可行范围。我们在解码过程中的GPU利用率峰值高达15.8%,是常规解码的最高值的3倍以上,是单序列推测解码的大约10倍。
English
Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.

Summary

AI-Generated Summary

PDF111December 15, 2024