프롬프트 캐시: 저지연 추론을 위한 모듈형 어텐션 재사용

초록

우리는 다양한 LLM(대형 언어 모델) 프롬프트 간에 어텐션 상태를 재사용하여 추론 속도를 가속화하는 접근 방식인 Prompt Cache를 제안합니다. 많은 입력 프롬프트에는 시스템 메시지, 프롬프트 템플릿, 컨텍스트로 제공되는 문서와 같은 중복되는 텍스트 세그먼트가 존재합니다. 우리의 핵심 통찰은 이러한 빈번히 발생하는 텍스트 세그먼트의 어텐션 상태를 추론 서버에서 미리 계산하고 저장함으로써, 사용자 프롬프트에서 이 세그먼트가 나타날 때 효율적으로 재사용할 수 있다는 것입니다. Prompt Cache는 이러한 재사용 가능한 텍스트 세그먼트를 명시적으로 정의하기 위해 스키마를 사용하며, 이를 프롬프트 모듈이라고 부릅니다. 이 스키마는 어텐션 상태 재사용 시 위치 정확성을 보장하고, 사용자가 캐시된 상태를 프롬프트에서 접근할 수 있는 인터페이스를 제공합니다. 프로토타입 구현을 통해 여러 LLM에 걸쳐 Prompt Cache를 평가한 결과, 특히 문서 기반 질의응답 및 추천과 같은 긴 프롬프트에서 첫 토큰까지의 지연 시간이 크게 감소함을 확인했습니다. GPU 기반 추론에서는 8배, CPU 기반 추론에서는 60배까지 성능이 개선되었으며, 출력 정확도를 유지하고 모델 파라미터 수정 없이 이러한 결과를 달성했습니다.

English

We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt. Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.

프롬프트 캐시: 저지연 추론을 위한 모듈형 어텐션 재사용

Prompt Cache: Modular Attention Reuse for Low-Latency Inference

초록

Support