Ontketende Odysseus: Geheugenefficiënte contextparallelisering via hoofdgewijze chunking

Samenvatting

Het efficiënt verwerken van lange sequenties met Transformer-modellen vereist doorgaans het splitsen van berekeningen over accelerators via context-parallelisme. De dominante benaderingen in deze methodenfamilie, zoals Ring Attention of DeepSpeed Ulysses, maken schaling over de contextdimensie mogelijk, maar richten zich niet op geheugenefficiëntie, wat de ondersteunde sequentielengtes beperkt. Geavanceerdere technieken, zoals Fully Pipelined Distributed Transformer of het uitbesteden van activaties, kunnen de mogelijke contextlengte verder verlengen ten koste van de trainingsdoorvoer. In dit artikel presenteren we UPipe, een eenvoudige maar effectieve context-parallelismetechniek die fijnmazige segmentatie op het niveau van de aandachtskoppen uitvoert. Deze techniek vermindert het geheugengebruik van zelf-attentie aanzienlijk, doorbreekt de barrière van activatiegeheugen en maakt veel langere contextlengtes mogelijk. Onze aanpak vermindert het geheugengebruik van tussenliggende tensoren in de attentielaag met maar liefst 87,5% voor 32B Transformers, terwijl dezelfde trainingssnelheid als eerdere context-parallelismetechnieken wordt gehandhaafd. UPipe kan een contextlengte van 5M tokens ondersteunen bij het trainen van Llama3-8B op een enkele 8×H100-node, een verbetering van meer dan 25% ten opzichte van eerdere methoden.

English

Efficiently processing long sequences with Transformer models usually requires splitting the computations across accelerators via context parallelism. The dominant approaches in this family of methods, such as Ring Attention or DeepSpeed Ulysses, enable scaling over the context dimension but do not focus on memory efficiency, which limits the sequence lengths they can support. More advanced techniques, such as Fully Pipelined Distributed Transformer or activation offloading, can further extend the possible context length at the cost of training throughput. In this paper, we present UPipe, a simple yet effective context parallelism technique that performs fine-grained chunking at the attention head level. This technique significantly reduces the activation memory usage of self-attention, breaking the activation memory barrier and unlocking much longer context lengths. Our approach reduces intermediate tensor memory usage in the attention layer by as much as 87.5% for 32B Transformers, while matching previous context parallelism techniques in terms of training speed. UPipe can support the context length of 5M tokens when training Llama3-8B on a single 8timesH100 node, improving upon prior methods by over 25%.

Ontketende Odysseus: Geheugenefficiënte contextparallelisering via hoofdgewijze chunking

Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

Samenvatting

Support