BASS：批次式注意力優化推測抽樣

摘要

推測式解碼已成為提高大型語言模型主機的延遲和吞吐量的強大方法。然而，大多數現有的實現專注於生成單個序列。現實世界中的生成式人工智慧應用通常需要多個回應，如何在批處理環境中執行推測式解碼，同時保持其延遲效益，構成了一個非常困難的挑戰。本文描述了一種批次推測式解碼系統，該系統在多序列生成延遲方面設立了一個新的技術水準，並展示了優越的GPU利用率以及在時間預算內生成的質量。例如，對於單個A100 GPU上的7.8B規模模型，批次大小為8，每個序列的平均生成速度為每個標記5.8毫秒，總吞吐量為每秒1.1K個標記。這些結果代表了最先進的延遲和比優化的常規解碼快2.15倍。在常規解碼無法完成的時間預算內，我們的系統能夠生成具有43%的HumanEval Pass@First和61%的Pass@All的序列，遠超過單序列推測式解碼的可行性。我們在解碼期間的GPU利用率高達15.8%，是常規解碼的最高值的3倍以上，是單序列推測式解碼的約10倍。

English

Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.

BASS：批次式注意力優化推測抽樣

BASS: Batched Attention-optimized Speculative Sampling

摘要

Support