Ster Aandacht: Efficiënte LLM Inferentie over Lange Sequenties

Samenvatting

Inferentie met op Transformer gebaseerde Grote Taalmodellen (LLM's) op lange sequenties is zowel kostbaar als traag vanwege de kwadratische complexiteit van het zelfaandachtsmechanisme. We introduceren Star Attention, een tweefasen blok-schaarse benadering die de computationele efficiëntie verbetert door aandacht over meerdere hosts te verdelen terwijl de communicatie-overhead wordt geminimaliseerd. In de eerste fase wordt de context verwerkt met bloksgewijze lokale aandacht over hosts, parallel. In de tweede fase wonen query- en antwoordtokens bij aan alle eerdere gecachte tokens via sequentie-globale aandacht. Star Attention integreert naadloos met de meeste op Transformer gebaseerde LLM's die zijn getraind met globale aandacht, waardoor geheugenvereisten en inferentietijd met maximaal 11x worden verminderd, terwijl 95-100% van de nauwkeurigheid behouden blijft.

English

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.

Ster Aandacht: Efficiënte LLM Inferentie over Lange Sequenties

Star Attention: Efficient LLM Inference over Long Sequences

Samenvatting

Summary

Support

Support