LycheeDecode: Versnelling van LLM-inferentie met lange context via hybride-head sparse decoding

Samenvatting

De opkomst van langcontext-grote-taalmmodellen (LLM's) legt een belangrijke bottleneck bloot: de snel uitdijende key-value cache tijdens decodering, wat aanzienlijke geheugen- en latentiekosten met zich meebrengt. Hoewel recente benaderingen proberen dit te verlichten door één set cruciale tokens over lagen te delen, ondermijnt zulke grove deling de modelprestaties door de functionele diversiteit van attention-heads te negeren. Om dit aan te pakken, stellen wij LycheeDecode voor, een efficiënte decoderingsmethode gecentreerd rond een fijnmazig hybrid-head-attentionmechanisme dat een hardware-efficiënte top-k-selectiestrategie hanteert. Concreet deelt het nieuwe, op HardKuma gebaseerde mechanisme attention-heads in een kleine subset retrieval-heads die dynamisch cruciale tokens identificeren, en een meerderheid van sparse-heads die deze hergebruiken voor efficiënte berekening. Via uitgebreide experimenten met toonaangevende modellen zoals Llama3 en Qwen3 op diverse benchmarks voor langcontext-begrip (bijv. LongBench, RULER) en complex redeneren (bijv. AIME24, OlympiadBench), tonen we aan dat LycheeDecode een generatieve kwaliteit bereikt die vergelijkbaar is met, en soms zelfs de volledige-attention-baseline overtreft. Cruciaal is dat dit wordt gerealiseerd met een versnelling tot wel 2,7x bij een contextlengte van 128K. Door de functionele diversiteit van attention-heads te behouden, overwint onze fijnmazige strategie de prestatiebeperkingen van bestaande methoden en biedt ze een krachtig en gevalideerd pad naar zowel efficiënte als hoogwaardige langcontext-LLM-inferentie.

English

The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.

LycheeDecode: Versnelling van LLM-inferentie met lange context via hybride-head sparse decoding

LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

Samenvatting

Support