단계적 추측적 디코딩을 통한 LLM 추론 가속화

초록

최근 대형 언어 모델(LLM)의 발전은 그 다양한 능력을 보여주고 있습니다. 우리는 소규모 배치 및 온디바이스 시나리오에서 LLM 추론을 가속화하기 위해 새로운 알고리즘인 단계적 추측 디코딩(staged speculative decoding)을 제안합니다. 우리는 소규모 배치 추론의 낮은 연산 강도를 해결하기 위해 기존의 추측 디코딩 연구를 개선했습니다. 첫째, 추측 배치를 트리 구조로 재구성하여 생성 비용을 줄이고 배치당 예상 토큰 수를 증가시켰습니다. 둘째, 두 번째 단계의 추측 디코딩을 추가했습니다. 이를 종합적으로 적용함으로써 762M 파라미터 GPT-2-L 모델에서 단일 배치 디코딩 지연 시간을 3.16배 감소시키면서도 출력 품질을 완벽하게 유지했습니다.

English

Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. First, we restructure the speculative batch as a tree, which reduces generation costs and increases the expected tokens per batch. Second, we add a second stage of speculative decoding. Taken together, we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model while perfectly preserving output quality.

단계적 추측적 디코딩을 통한 LLM 추론 가속화

Accelerating LLM Inference with Staged Speculative Decoding

초록

Support