未束縛のユリシーズ：ヘッドワイズチャンキングによるメモリ効率の高いコンテクスト並列処理

要旨

Transformerモデルで長い系列を効率的に処理するには、通常、コンテキスト並列化によって演算をアクセラレータ間で分割する必要があります。この手法群における主流なアプローチであるRing AttentionやDeepSpeed Ulyssesなどは、コンテキスト次元にわたるスケーリングを可能にしますが、メモリ効率に重点を置いていないため、サポート可能な系列長が制限されます。Fully Pipelined Distributed Transformerや活性化データのオフローディングといったより高度な技術は、訓練スループットを犠牲にすることで、可能なコンテキスト長をさらに延伸できます。本論文では、アテンションヘッドレベルできめ細かいチャンキングを行う、シンプルかつ効果的なコンテキスト並列化技術であるUPipeを提案します。この技術は、セルフアテンションの活性化メモリ使用量を大幅に削減し、活性化メモリの壁を打破して、はるかに長いコンテキスト長を実現します。我々のアプローチは、32BパラメータのTransformerにおいてアテンション層の中間テンソルのメモリ使用量を最大87.5%削減しつつ、訓練速度では従来のコンテキスト並列化技術と同等の性能を達成します。UPipeは、単一の8xH100ノードでLlama3-8Bを訓練する際に500万トークンのコンテキスト長をサポート可能であり、従来手法を25%以上上回る改善を示します。

English

Efficiently processing long sequences with Transformer models usually requires splitting the computations across accelerators via context parallelism. The dominant approaches in this family of methods, such as Ring Attention or DeepSpeed Ulysses, enable scaling over the context dimension but do not focus on memory efficiency, which limits the sequence lengths they can support. More advanced techniques, such as Fully Pipelined Distributed Transformer or activation offloading, can further extend the possible context length at the cost of training throughput. In this paper, we present UPipe, a simple yet effective context parallelism technique that performs fine-grained chunking at the attention head level. This technique significantly reduces the activation memory usage of self-attention, breaking the activation memory barrier and unlocking much longer context lengths. Our approach reduces intermediate tensor memory usage in the attention layer by as much as 87.5% for 32B Transformers, while matching previous context parallelism techniques in terms of training speed. UPipe can support the context length of 5M tokens when training Llama3-8B on a single 8timesH100 node, improving upon prior methods by over 25%.

未束縛のユリシーズ：ヘッドワイズチャンキングによるメモリ効率の高いコンテクスト並列処理

Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

要旨

Support