スター・アテンション：長いシーケンス上で効率的なLLM推論

要旨

Transformerベースの大規模言語モデル（LLMs）による長いシーケンスでの推論は、自己注意メカニズムの二次の複雑さにより、コストがかかり遅くなります。私たちはStar Attentionを導入しました。これは、複数のホスト間でアテンションを分割し、通信オーバーヘッドを最小限に抑えることで計算効率を向上させる2段階のブロック疎な近似です。最初の段階では、コンテキストはホスト間でブロックごとのローカルなアテンションを使って並列に処理されます。2番目の段階では、クエリとレスポンスのトークンは、シーケンス全体のアテンションを介してすべての以前のキャッシュされたトークンにアテンションを向けます。Star Attentionは、グローバルアテンションでトレーニングされたほとんどのTransformerベースのLLMsとシームレスに統合され、メモリ要件と推論時間を最大11倍削減し、95-100%の精度を維持します。

English

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.

スター・アテンション：長いシーケンス上で効率的なLLM推論

Star Attention: Efficient LLM Inference over Long Sequences

要旨

Support