2BP: 2단계 역전파

초록

딥 뉴럴 네트워크(DNN)의 크기와 복잡성이 증가함에 따라, 단일 가속기의 메모리 용량을 초과하는 경우가 많아져 모델 파라미터를 여러 가속기에 분할(sharding)해야 하는 상황이 발생합니다. 파이프라인 병렬화는 대규모 DNN을 학습하기 위해 일반적으로 사용되는 분할 전략입니다. 그러나 현재의 파이프라인 병렬화 구현은 머신러닝 프레임워크에서 제공하는 자동 미분 도구에 의해 의도치 않게 병목 현상이 발생하고 있습니다. 본 논문에서는 2단계 역전파(2BP)를 소개합니다. 역전파 단계를 두 개의 별도 단계로 분리함으로써, 유휴 계산 시간을 줄일 수 있습니다. 다양한 모델 아키텍처와 파이프라인 스케줄에 대해 2BP를 테스트한 결과, 모든 경우에서 처리량(throughput)이 증가하였습니다. 2BP를 사용하여 4개의 GPU에 걸쳐 70억 개의 파라미터를 가진 LLaMa와 유사한 트랜스포머 모델을 학습할 때, 기존 방법 대비 1.70배의 처리량 증가를 달성할 수 있었습니다.

English

As Deep Neural Networks (DNNs) grow in size and complexity, they often exceed the memory capacity of a single accelerator, necessitating the sharding of model parameters across multiple accelerators. Pipeline parallelism is a commonly used sharding strategy for training large DNNs. However, current implementations of pipeline parallelism are being unintentionally bottlenecked by the automatic differentiation tools provided by ML frameworks. This paper introduces 2-stage backpropagation (2BP). By splitting the backward propagation step into two separate stages, we can reduce idle compute time. We tested 2BP on various model architectures and pipelining schedules, achieving increases in throughput in all cases. Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods when training a LLaMa-like transformer with 7 billion parameters across 4 GPUs.

2BP: 2단계 역전파

2BP: 2-Stage Backpropagation

초록

Summary

Support

Support