SubGen: 부분 선형 시간 및 메모리 내 토큰 생성

초록

대규모 언어 모델(LLM)의 상당한 성공에도 불구하고, 이들의 광범위한 메모리 요구 사항은 장문맥 토큰 생성에 배포하는 데 있어 도전 과제로 남아 있습니다. LLM 디코더의 상당한 메모리 사용량은 주의력 모듈에서 이전의 모든 토큰을 저장해야 하는 데서 비롯되며, 이는 키-값(KV) 캐싱에 의해 요구되는 사항입니다. 본 연구에서는 KV 캐시를 위한 효율적인 압축 기술 개발에 초점을 맞추고 있습니다. 실험적 증거는 주의력 모듈 내의 키 임베딩에서 상당한 클러스터링 경향이 있음을 보여줍니다. 이러한 핵심 통찰을 바탕으로, 우리는 키 토큰에 대한 온라인 클러스터링과 값에 대한 온라인 ell_2 샘플링을 사용하여 하위 선형 복잡도를 가진 새로운 캐싱 방법을 고안했습니다. 그 결과, SubGen이라는 명확히 정확하고 효율적인 주의력 디코딩 알고리즘이 탄생했습니다. 이 알고리즘은 하위 선형 메모리 사용량과 하위 선형 시간 복잡도를 보장할 뿐만 아니라, 우리의 접근 방식에 대한 엄격한 오류 한계도 설정했습니다. 장문맥 질의응답 작업에 대한 실험적 평가는 SubGen이 성능과 효율성 측면에서 기존 및 최신 KV 캐시 압축 방법을 크게 능가함을 보여줍니다.

English

Despite the significant success of large language models (LLMs), their extensive memory requirements pose challenges for deploying them in long-context token generation. The substantial memory footprint of LLM decoders arises from the necessity to store all previous tokens in the attention module, a requirement imposed by key-value (KV) caching. In this work, our focus is on developing an efficient compression technique for the KV cache. Empirical evidence indicates a significant clustering tendency within key embeddings in the attention module. Building on this key insight, we have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online ell_2 sampling on values. The result is a provably accurate and efficient attention decoding algorithm, termed SubGen. Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach. Empirical evaluations on long-context question-answering tasks demonstrate that SubGen significantly outperforms existing and state-of-the-art KV cache compression methods in terms of performance and efficiency.

SubGen: 부분 선형 시간 및 메모리 내 토큰 생성

SubGen: Token Generation in Sublinear Time and Memory

초록

Support