SwiftKV: 지식 보존 모델 변환을 통한 빠른 프리필 최적화 추론

초록

인기 있는 기업용 사례인 요약, RAG 및 코드 생성과 같은 경우에 대해 LLM 추론은 일반적으로 생성 길이보다 수십 배 긴 프롬프트 길이를 관찰합니다. 이 특성은 프리필 비용과 응답 대기 시간 증가로 이어집니다. 본 논문에서는 프롬프트 토큰 처리 시간과 비용을 줄이면서 생성된 토큰의 고품질을 유지하기 위해 특별히 설계된 신규 모델 변환 및 증류 절차인 SwiftKV를 제안합니다. SwiftKV는 세 가지 주요 메커니즘을 결합합니다: i) SingleInputKV는 후반 레이어의 KV 캐시를 훨씬 이른 레이어의 출력을 사용하여 미리 채우며, 프롬프트 토큰이 모델 계산의 많은 부분을 건너뛸 수 있도록 합니다. ii) AcrossKV는 인접한 레이어의 KV 캐시를 병합하여 메모리 풋프린트를 줄이고 더 큰 배치 크기를 지원하여 처리량을 높입니다. iii) 기존 LLM을 SwiftKV에 적응시킬 수 있는 지식 보존 증류 절차로, 최소한의 정확도 영향과 낮은 계산 및 데이터 요구 사항으로 SwiftKV를 위한 LLM을 조정할 수 있습니다. Llama-3.1-8B 및 70B의 경우, SwiftKV는 프리필의 계산 요구 사항을 50% 줄이고 KV 캐시의 메모리 요구 사항을 62.5% 줄이면서 다양한 작업 범위에서 최소 품질 저하를 초래합니다. 최적화된 vLLM 구현을 사용한 엔드 투 엔드 추론 서비스에서, SwiftKV는 최대 2배 높은 총 처리량과 60% 낮은 출력 토큰 당 시간을 실현할 수 있습니다. 4x H100 GPU에서 16비트 정밀도로 Llama-3.1-70B에 대해 16K 토큰/초를 의미하는 560 TFlops/GPU의 표준화된 추론 처리량을 달성할 수 있습니다.

English

LLM inference for popular enterprise use cases, such as summarization, RAG, and code-generation, typically observes orders of magnitude longer prompt lengths than generation lengths. This characteristic leads to high cost of prefill and increased response latency. In this paper, we present SwiftKV, a novel model transformation and distillation procedure specifically designed to reduce the time and cost of processing prompt tokens while preserving high quality of generated tokens. SwiftKV combines three key mechanisms: i) SingleInputKV, which prefills later layers' KV cache using a much earlier layer's output, allowing prompt tokens to skip much of the model computation, ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the memory footprint and support larger batch size for higher throughput, and iii) a knowledge-preserving distillation procedure that can adapt existing LLMs for SwiftKV with minimal accuracy impact and low compute and data requirement. For Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50% and the memory requirement of the KV cache by 62.5% while incurring minimum quality degradation across a wide range of tasks. In the end-to-end inference serving using an optimized vLLM implementation, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100 GPUs.

SwiftKV: 지식 보존 모델 변환을 통한 빠른 프리필 최적화 추론

SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

초록

Support