Het doorbreken van de bubbel: Asynchrone pijplijn parallelle training met beperkte gewichtsinconsistentie

Samenvatting

Pijplijnparallellisme is essentieel voor het trainen van grote neurale netwerken, maar bestaande schema's maken een afweging tussen doorvoer, geheugen en optimalisatieconsistentie. Synchrone pijplijnen behouden voorwaartse/achterwaartse gewichtsconsistentie maar hebben last van bellen; asynchrone pijplijnen verwijderen bellen maar introduceren een gewichtsversie-mismatch, wat doorgaans weight stashing, voorspelling of correctiemechanismen vereist. We introduceren PACI (Pijplijn Asynchrone training met Beheerste Inconsistentie), een bellenvrije asynchrone pijplijnmethode die de voorwaartse/achterwaartse versie-afwijking begrenst zonder weight stashing, voorspelling, extra parameterkopieën of globale synchronisatie. Het sleutelidee is om lokale gradiëntaccumulatie te gebruiken als versiebeheermechanisme: door de parameter-versie-evolutie te vertragen ten opzichte van de pijplijnvertraging, beperkt PACI het aantal optimizer-updates dat een micro-batch doorkruist, terwijl steady-state benutting behouden blijft. In GPT-achtige taalmodel-pretraining evenaart PACI de stabiliteit en uiteindelijke perplexiteit van synchrone 1F1B-flush, behoudt hetzelfde piekgeheugengebruik, bereikt volledig benutte pijplijndoorvoer en verbetert de trainingstijd tot nauwkeurigheid met tot 1,69 keer ten opzichte van de snelste flush-baseline. Deze resultaten tonen aan dat voorwaartse/achterwaartse inconsistentie niet geëlimineerd hoeft te worden: wanneer expliciet begrensd, kan het veilig worden ingeruild voor aanzienlijke efficiëntiewinsten.

English

Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer from bubbles; asynchronous pipelines remove bubbles but introduce weight-version mismatch, typically requiring weight stashing, prediction, or correction mechanisms. We introduce PACI (Pipeline Asynchronous training with Controlled Inconsistency), a bubble-free asynchronous pipeline method that bounds forward/backward version drift without weight stashing, prediction, additional parameter copies, or global synchronization. The key idea is to use local gradient accumulation as a version-control mechanism: by slowing parameter-version evolution relative to pipeline delay, PACI limits the number of optimizer updates crossed by any micro-batch while preserving steady-state utilization. In GPT-style language-model pretraining, PACI matches the stability and final perplexity of synchronous 1F1B-flush, retains the same peak memory footprint, achieves fully utilized pipeline throughput, and improves training time-to-accuracy by up to 1.69times over the fastest flush baseline. These results show that forward/backward inconsistency need not be eliminated: when explicitly bounded, it can be safely traded for substantial efficiency gains.