ChatPaper.aiChatPaper

荔枝解码:基于混合头稀疏解码的长上下文大语言模型推理加速

LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

February 4, 2026
作者: Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, Min Zhang
cs.AI

摘要

长上下文大语言模型(LLMs)的普及暴露了一个关键瓶颈:解码过程中快速扩张的键值缓存带来了沉重的内存与延迟开销。现有方法尝试通过跨层共享单一关键令牌集来缓解这一问题,但此类粗粒度共享方案忽视了注意力头功能的多样性,反而损害了模型性能。为此,我们提出LycheeDecode——一种以细粒度混合注意力头机制为核心的高效解码方法,该方法采用硬件友好的top-k选择策略。具体而言,基于HardKuma的新颖机制将注意力头划分为两类:少量检索头动态识别关键令牌,多数稀疏头复用这些令牌以实现高效计算。通过在Llama3、Qwen3等主流模型上开展广泛实验,覆盖长文本理解(如LongBench、RULER)和复杂推理(如AIME24、OlympiadBench)等多类评测基准,我们证明LycheeDecode的生成质量可媲美甚至部分超越全注意力基线。关键的是,在128K上下文长度下该方法可实现最高2.7倍的加速效果。通过保留注意力头的功能多样性,我们的细粒度策略突破了现有方法的性能瓶颈,为长上下文LLM推理提供了一条兼顾高效与高质量的可行路径。
English
The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.
PDF83March 19, 2026