SparDA: Sparse Ontkoppelde Aandacht voor Efficiënte Lang-Context LLM Inferentie

Samenvatting

Schaarse aandacht vermindert de rekenkracht en geheugenbandbreedte voor inferentie van lange-context LLM's. Er blijven echter twee belangrijke uitdagingen: (1) de KV-cachecapaciteit groeit nog steeds met de sequentielengte, en offloaden naar CPU-geheugen introduceert een PCIe-overdrachtsknelpunt; (2) de schaarse selectiestap zelf behoudt O(T²)-complexiteit en kan bij lange contexten de aandachtskosten domineren. Wij stellen SparDA voor, een ontkoppelde schaarse aandachtsarchitectuur die een vierde projectie per laag introduceert, de Forecast, naast Query, Key en Value. De Forecast voorspelt de KV-blokken die de volgende laag nodig heeft, waardoor vooruitkijkende selectie mogelijk wordt die CPU-naar-GPU prefetch overlapt met de uitvoering van de huidige laag. Omdat Forecast is ontkoppeld van de aandachtsquery, gebruikt onze GQA-implementatie één Forecast-hoofd per GQA-groep, waardoor de selectie-overhead wordt verminderd ten opzichte van de oorspronkelijke multi-head selector. SparDA voegt <0,5% parameters toe en traint alleen de Forecast-projecties door de aandachtsdistributie van de oorspronkelijke selector te matchen. Op twee schaars voorgetrainde 8B-modellen evenaart of verbetert SparDA de nauwkeurigheid en levert het tot 1,25 keer prefill-versnelling en 1,7 keer decode-versnelling op ten opzichte van de schaarse-aandacht-offload-baseline. Door grotere haalbare batchgroottes op een enkele GPU mogelijk te maken, bereikt SparDA verder tot 5,3 keer hogere decode-doorvoer dan de schaarse baseline zonder offload. Onze broncode is beschikbaar op https://github.com/NVlabs/SparDA.

English

Sparse attention reduces compute and memory bandwidth for long-context LLM inference. However, two key challenges remain: (1) KV cache capacity still grows with sequence length, and offloading to CPU memory introduces a PCIe transfer bottleneck; (2) the sparse selection step itself retains O(T^2) complexity and can dominate attention cost at long contexts. We propose SparDA, a decoupled sparse attention architecture that introduces a fourth per-layer projection, the Forecast, alongside Query, Key, and Value. The Forecast predicts the KV blocks needed by the next layer, enabling lookahead selection that overlaps CPU-to-GPU prefetch with current-layer execution. Because Forecast is decoupled from the attention query, our GQA implementation uses one Forecast head per GQA group, reducing selection overhead versus the original multi-head selector. SparDA adds <0.5% parameters and trains only the Forecast projections by matching the original selector's attention distribution. On two sparse-pretrained 8B models, SparDA matches or slightly improves accuracy and delivers up to 1.25times prefill speedup and 1.7times decode speedup over the sparse-attention offload baseline. By enabling larger feasible batch sizes on a single GPU, SparDA further reaches up to 5.3times higher decode throughput than the non-offload sparse baseline. Our source code is available at https://github.com/NVlabs/SparDA.