TriForce: 계층적 추론적 디코딩을 통한 장문 시퀀스 생성의 무손실 가속화

초록

최근 대규모 언어 모델(LLM)이 긴 콘텐츠 생성에 널리 활용되면서, 효율적인 장시퀀스 추론 지원에 대한 수요가 증가하고 있습니다. 그러나 재계산을 피하기 위해 저장되는 키-값(KV) 캐시는 시퀀스 길이에 따라 선형적으로 증가하며 주요 병목 현상으로 부각되고 있습니다. LLM의 자기회귀적 특성으로 인해, 생성되는 모든 토큰에 대해 전체 KV 캐시가 로드되어 계산 코어의 활용도가 낮고 지연 시간이 길어지는 문제가 발생합니다. 이 문제를 완화하기 위해 다양한 KV 캐시 압축 방법이 제안되었지만, 생성 품질 저하라는 단점이 있습니다. 우리는 장시퀀스 생성에 확장 가능한 계층적 추측 디코딩 시스템인 TriForce를 소개합니다. 이 접근법은 원본 모델 가중치와 검색을 통한 동적 희소 KV 캐시를 드래프트 모델로 활용하며, 이는 계층 구조의 중간층 역할을 하고 더 작은 모델에 의해 추측되어 드래프팅 지연 시간을 줄입니다. TriForce는 Llama2-7B-128K에서 A100 GPU에서 최대 2.31배의 속도 향상을 달성할 뿐만 아니라, 더 긴 컨텍스트를 처리하는 데 있어서도 확장성을 보여줍니다. 두 개의 RTX 4090 GPU를 사용한 오프로딩 설정에서 TriForce는 토큰당 0.108초를 달성하며, 이는 A100에서의 자기회귀 기준선의 절반 수준에 불과하고, 우리의 최적화된 오프로딩 시스템에서 7.78배의 성능을 보입니다. 또한, 단일 RTX 4090 GPU에서 DeepSpeed-Zero-Inference보다 4.86배 더 나은 성능을 보입니다. TriForce의 견고성은 다양한 온도 설정에서 일관되게 뛰어난 성능을 보이는 것으로 입증됩니다. 코드는 https://github.com/Infini-AI-Lab/TriForce에서 확인할 수 있습니다.

English

With large language models (LLMs) widely deployed in long content generation recently, there has emerged an increasing demand for efficient long-sequence inference support. However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length. Due to the auto-regressive nature of LLMs, the entire KV cache will be loaded for every generated token, resulting in low utilization of computational cores and high latency. While various compression methods for KV cache have been proposed to alleviate this issue, they suffer from degradation in generation quality. We introduce TriForce, a hierarchical speculative decoding system that is scalable to long sequence generation. This approach leverages the original model weights and dynamic sparse KV cache via retrieval as a draft model, which serves as an intermediate layer in the hierarchy and is further speculated by a smaller model to reduce its drafting latency. TriForce not only facilitates impressive speedups for Llama2-7B-128K, achieving up to 2.31times on an A100 GPU but also showcases scalability in handling even longer contexts. For the offloading setting on two RTX 4090 GPUs, TriForce achieves 0.108s/tokenx2014only half as slow as the auto-regressive baseline on an A100, which attains 7.78times on our optimized offloading system. Additionally, TriForce performs 4.86times than DeepSpeed-Zero-Inference on a single RTX 4090 GPU. TriForce's robustness is highlighted by its consistently outstanding performance across various temperatures. The code is available at https://github.com/Infini-AI-Lab/TriForce.

TriForce: 계층적 추론적 디코딩을 통한 장문 시퀀스 생성의 무손실 가속화

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

초록

Support