MagicDec: 장기 맥락 생성을 위한 지연 시간과 처리량의 상충 관계 극복을 위한 예측 디코딩

초록

대형 언어 모델 (LLM)은 상호 작용형 챗봇, 문서 분석 및 에이전트 워크플로우와 같은 장문 맥락 응용 프로그램에서 더 많이 사용되고 있지만, 낮은 대기 시간과 높은 처리량으로 장문 맥락 요청을 처리하는 것은 어려운 과제입니다. 추론을 위한 추측적 디코딩 (SD)은 성능을 희생하지 않으면서 대기 시간을 줄이기 위한 널리 사용되는 기술이지만, 기존의 지혜는 그 효과가 작은 배치 크기에 제한된다고 제안합니다. MagicDec에서는 놀랍게도 SD가 중간에서 긴 시퀀스에 대해 높은 처리량 추론 체제에서도 가속화를 달성할 수 있음을 보여줍니다. 더 흥미로운 점은 우리의 철저한 분석을 바탕으로 배치 크기가 증가함에 따라 더 나은 가속화를 달성할 수 있는 지능적인 초안 전략입니다. MagicDec는 먼저 배치 크기와 시퀀스 길이가 증가함에 따라 병목 현상이 어떻게 변화하는지 식별하고, 이러한 통찰을 사용하여 높은 처리량 추론을 위해 추론을 더 효과적으로 배치합니다. 그런 다음, 시퀀스 길이와 배치 크기 모두에 비례하는 KV 병목 현상을 해결하기 위해 희소 KV 캐시를 사용하는 초안 모델을 활용합니다.

English

Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and agent workflows, but it is challenging to serve long-context requests with low latency and high throughput. Speculative decoding (SD) is a widely used technique to reduce latency without sacrificing performance but the conventional wisdom suggests that its efficacy is limited to small batch sizes. In MagicDec, we show that surprisingly SD can achieve speedup even for a high throughput inference regime for moderate to long sequences. More interestingly, an intelligent drafting strategy can achieve better speedup with increasing batch size based on our rigorous analysis. MagicDec first identifies the bottleneck shifts with increasing batch size and sequence length, and uses these insights to deploy speculative decoding more effectively for high throughput inference. Then, it leverages draft models with sparse KV cache to address the KV bottleneck that scales with both sequence length and batch size.

MagicDec: 장기 맥락 생성을 위한 지연 시간과 처리량의 상충 관계 극복을 위한 예측 디코딩

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

초록

Support