버블 깨기: 제한된 가중치 불일치를 적용한 비동기 파이프라인 병렬 학습

초록

파이프라인 병렬 처리는 대규모 신경망 학습에 필수적이지만, 기존 스케줄은 처리량, 메모리, 최적화 일관성 간의 상충 관계를 수반한다. 동기식 파이프라인은 순전파/역전파의 가중치 일관성을 유지하지만 버블이 발생하고, 비동기식 파이프라인은 버블을 제거하지만 가중치 버전 불일치가 생겨 일반적으로 가중치 저장, 예측 또는 보정 메커니즘이 필요하다. 본 연구에서는 PACI(Pipeline Asynchronous training with Controlled Inconsistency)를 제안한다. 이는 가중치 저장, 예측, 추가 파라미터 복사본 또는 전역 동기화 없이 순전파/역전파 버전 드리프트를 제한하는 버블 없는 비동기식 파이프라인 방법이다. 핵심 아이디어는 지역 그래디언트 누적을 버전 제어 메커니즘으로 활용하는 것이다. 즉, 파이프라인 지연 대비 파라미터 버전의 진화 속도를 늦춤으로써, PACI는 정상 상태 활용도를 유지하면서 어떤 마이크로 배치가 교차하는 옵티마이저 업데이트 횟수를 제한한다. GPT 방식 언어 모델 사전 학습에서 PACI는 동기식 1F1B-flush의 안정성과 최종 퍼플렉서티에 도달하며, 동일한 최대 메모리 사용량을 유지하고, 완전히 활용된 파이프라인 처리량을 달성하며, 가장 빠른 플러시 기준선 대비 최대 1.69배까지 학습 시간-정확도를 개선한다. 이러한 결과는 순전파/역전파 불일치를 제거할 필요가 없음을 보여준다. 명시적으로 제한될 경우, 이는 상당한 효율성 향상을 위해 안전하게 희생될 수 있다.

English

Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer from bubbles; asynchronous pipelines remove bubbles but introduce weight-version mismatch, typically requiring weight stashing, prediction, or correction mechanisms. We introduce PACI (Pipeline Asynchronous training with Controlled Inconsistency), a bubble-free asynchronous pipeline method that bounds forward/backward version drift without weight stashing, prediction, additional parameter copies, or global synchronization. The key idea is to use local gradient accumulation as a version-control mechanism: by slowing parameter-version evolution relative to pipeline delay, PACI limits the number of optimizer updates crossed by any micro-batch while preserving steady-state utilization. In GPT-style language-model pretraining, PACI matches the stability and final perplexity of synchronous 1F1B-flush, retains the same peak memory footprint, achieves fully utilized pipeline throughput, and improves training time-to-accuracy by up to 1.69times over the fastest flush baseline. These results show that forward/backward inconsistency need not be eliminated: when explicitly bounded, it can be safely traded for substantial efficiency gains.