ZeCO: Zero Communication Overhead Sequentieparallellisme voor Lineaire Attention

Samenvatting

Lineaire aandachtmechanismen bieden aanzienlijke voordelen voor Large Language Models (LLM's) door lineaire computationele complexiteit te bieden, waardoor efficiënte verwerking van ultra-lange sequenties mogelijk wordt (bijv. 1M context). Bestaande Sequence Parallelism (SP)-methoden, die essentieel zijn voor het verdelen van deze workloads over apparaten, worden echter het primaire knelpunt vanwege aanzienlijke communicatie-overhead. In dit artikel introduceren we ZeCO (Zero Communication Overhead) sequence parallelism voor lineaire aandachtmodellen, een nieuwe SP-methode die ontworpen is om deze beperkingen te overwinnen en end-to-end bijna-lineaire schaalbaarheid te bereiken voor training met lange sequenties. Zo duurt het trainen van een model met een sequentielengte van 1M over 64 apparaten met ZeCO ongeveer even lang als trainen met een sequentie van 16k op een enkel apparaat. De kern van ZeCO is All-Scan, een nieuw collectief communicatieprimitief. All-Scan voorziet elke SP-rank van precies de initiële operatorstatus die het nodig heeft, terwijl een minimale communicatievoetafdruk wordt gehandhaafd, waardoor communicatie-overhead effectief wordt geëlimineerd. Theoretisch bewijzen we de optimaliteit van ZeCO, waarbij we aantonen dat het slechts verwaarloosbare tijd- en ruimte-overhead introduceert. Empirisch vergelijken we de communicatiekosten van verschillende sequence parallelism-strategieën en tonen we aan dat All-Scan de snelste communicatie bereikt in SP-scenario's. Specifiek, op 256 GPU's met een sequentielengte van 8M, behaalt ZeCO een snelheidswinst van 60\% ten opzichte van de huidige state-of-the-art (SOTA) SP-methode. Wij geloven dat ZeCO een duidelijke weg opent naar efficiënte training van next-generation LLM's op voorheen onhanteerbare sequentielengtes.

English

Linear attention mechanisms deliver significant advantages for Large Language Models (LLMs) by providing linear computational complexity, enabling efficient processing of ultra-long sequences (e.g., 1M context). However, existing Sequence Parallelism (SP) methods, essential for distributing these workloads across devices, become the primary bottleneck due to substantial communication overhead. In this paper, we introduce ZeCO (Zero Communication Overhead) sequence parallelism for linear attention models, a new SP method designed to overcome these limitations and achieve end-to-end near-linear scalability for long sequence training. For example, training a model with a 1M sequence length across 64 devices using ZeCO takes roughly the same time as training with an 16k sequence on a single device. At the heart of ZeCO lies All-Scan, a new collective communication primitive. All-Scan provides each SP rank with precisely the initial operator state it requires while maintaining a minimal communication footprint, effectively eliminating communication overhead. Theoretically, we prove the optimaity of ZeCO, showing that it introduces only negligible time and space overhead. Empirically, we compare the communication costs of different sequence parallelism strategies and demonstrate that All-Scan achieves the fastest communication in SP scenarios. Specifically, on 256 GPUs with an 8M sequence length, ZeCO achieves a 60\% speedup compared to the current state-of-the-art (SOTA) SP method. We believe ZeCO establishes a clear path toward efficiently training next-generation LLMs on previously intractable sequence lengths.

ZeCO: Zero Communication Overhead Sequentieparallellisme voor Lineaire Attention

ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention

Samenvatting

Support