LongNet: トランスフォーマーを1,000,000,000トークンまでスケーリング

要旨

大規模言語モデルの時代において、シーケンス長のスケーリングは重要な要求となっている。しかし、既存の手法は計算複雑性かモデルの表現力のいずれかに苦戦しており、最大シーケンス長が制限されている。本研究では、LongNetを紹介する。これは、より短いシーケンスでの性能を犠牲にすることなく、シーケンス長を10億トークン以上にスケールできるTransformerの変種である。具体的には、距離が増すにつれて注意範囲を指数関数的に拡大するdilated attentionを提案する。LongNetには以下の重要な利点がある：1) 線形の計算複雑性とトークン間の対数依存性を持つ、2) 極めて長いシーケンスの分散トレーニングとして機能できる、3) dilated attentionは標準的なattentionの代替としてドロップイン可能で、既存のTransformerベースの最適化とシームレスに統合できる。実験結果は、LongNetが長いシーケンスのモデリングと一般的な言語タスクの両方で強力な性能を発揮することを示している。本研究は、例えばコーパス全体やインターネット全体をシーケンスとして扱うなど、非常に長いシーケンスのモデリングに新たな可能性を開くものである。

English

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. In this work, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between tokens; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.

LongNet: トランスフォーマーを1,000,000,000トークンまでスケーリング

LongNet: Scaling Transformers to 1,000,000,000 Tokens

要旨

Support