리치디코드: 하이브리드 헤드 희소 디코딩을 통한 장문 컨텍스트 LLM 추론 가속화

초록

장문 맥락 대규모 언어 모델(LLM)의 확산은 디코딩 과정에서 키-값 캐시가 급격히 확장되며 심각한 메모리 및 지연 시간 부담을 초래하는 핵심 병목 현상을 드러내고 있습니다. 기존 접근법은 계층 간 핵심 토큰 집합을 단일하게 공유하는 방식으로 이 문제를 완화하려 했으나, 이러한 조잡한 수준의 공유는 어텐션 헤드의 기능적 다양성을 무시함으로써 모델 성능을 저해합니다. 이를 해결하기 위해 우리는 하드웨어 효율적인 상위 k개 선택 전략을 활용한 세분화된 하이브리드 헤드 어텐션 메커니즘을 중심으로 한 효율적인 디코딩 방법인 LycheeDecode를 제안합니다. 구체적으로, 새로운 HardKuma 기반 메커니즘은 어텐션 헤드를 핵심 토큰을 동적으로 식별하는 소수의 검색 헤드와 이를 재사용하여 효율적인 계산을 수행하는 대다수의 희소 헤드로 세분화합니다. Llama3 및 Qwen3과 같은 선도적 모델을 대상으로 장문 맥락 이해(LongBench, RULER) 및 복잡한 추론(AIME24, OlympiadBench) 등 다양한 벤치마크에서 진행한 폭넓은 실험을 통해 LycheeDecode가 전체 어텐션 기준선과 필적하거나 경우에 따라 능가하는 생성 품질을 달성함을 입증했습니다. 특히 128K 토큰 맥락 길이에서 최대 2.7배의 속도 향상과 함께 이成果를 달성했습니다. 어텐션 헤드의 기능적 다양성을 보존함으로써, 우리의 세분화된 전략은 기존 방법의 성능 병목 현상을 극복하며 효율적이면서도 고품질인 장문 맥락 LLM 추론을 위한 강력하고 검증된 경로를 제공합니다.

English

The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.

리치디코드: 하이브리드 헤드 희소 디코딩을 통한 장문 컨텍스트 LLM 추론 가속화

LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

초록

Support