Stern-Aufmerksamkeit: Effiziente LLM-Inferenz über lange Sequenzen

papers.abstract

Die Inferenz mit Transformer-basierten Large Language Models (LLMs) auf langen Sequenzen ist aufgrund der quadratischen Komplexität des Self-Attention-Mechanismus sowohl kostspielig als auch langsam. Wir stellen Star Attention vor, eine zweiphasige block-sparse Approximation, die die Rechenleistung verbessert, indem die Aufmerksamkeit über mehrere Hosts verteilt wird, während der Kommunikationsaufwand minimiert wird. In der ersten Phase wird der Kontext blockweise lokal über Hosts hinweg parallel verarbeitet. In der zweiten Phase nehmen Query- und Response-Token über eine sequenzglobale Aufmerksamkeit Bezug auf alle zuvor zwischengespeicherten Tokens. Star Attention integriert sich nahtlos in die meisten Transformer-basierten LLMs, die mit globaler Aufmerksamkeit trainiert wurden, und reduziert den Speicherbedarf und die Inferenzzeit um bis zu 11x, während 95-100% der Genauigkeit erhalten bleiben.

English

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.

Stern-Aufmerksamkeit: Effiziente LLM-Inferenz über lange Sequenzen

Star Attention: Efficient LLM Inference over Long Sequences

papers.abstract

Support