BASS: 배치 처리 최적화된 주의 기반 추론 샘플링

초록

스펙티브 디코딩(speculative decoding)은 대규모 언어 모델의 호스팅에서 지연 시간과 처리량을 개선하기 위한 강력한 방법으로 부상했습니다. 그러나 기존 구현 대부분은 단일 시퀀스 생성에 초점을 맞추고 있습니다. 실제 생성형 AI 애플리케이션은 종종 다중 응답을 요구하며, 스펙티브 디코딩을 배치 설정에서 수행하면서도 지연 시간 이점을 유지하는 것은 사소하지 않은 과제입니다. 본 논문은 다중 시퀀스 생성 지연 시간에서 새로운 최첨단 기술을 제시하는 배치 스펙티브 디코딩 시스템을 설명하며, 시간 예산 내에서 우수한 GPU 활용률과 생성 품질을 입증합니다. 예를 들어, 단일 A100 GPU에서 7.8B 크기의 모델을 사용하고 배치 크기가 8일 때, 각 시퀀스는 토큰당 평균 5.8ms의 속도로 생성되며, 전체 처리량은 초당 1.1K 토큰입니다. 이러한 결과는 최첨단 지연 시간을 나타내며, 최적화된 일반 디코딩 대비 2.15배의 속도 향상을 보여줍니다. 일반 디코딩이 완료하지 못하는 시간 예산 내에서, 우리 시스템은 HumanEval Pass@First 43%와 Pass@All 61%의 시퀀스를 생성할 수 있으며, 이는 단일 시퀀스 스펙티브 디코딩으로 가능한 것을 훨씬 능가합니다. 디코딩 중 최대 GPU 활용률은 15.8%에 달하며, 이는 일반 디코딩의 최고치보다 3배 이상, 단일 시퀀스 스펙티브 디코딩의 약 10배에 해당합니다.

English

Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.

BASS: 배치 처리 최적화된 주의 기반 추론 샘플링

BASS: Batched Attention-optimized Speculative Sampling

초록

Summary

Support

Support