2BP：两阶段反向传播

摘要

随着深度神经网络（DNNs）规模和复杂性的增长，通常超出单个加速器的内存容量，需要将模型参数分片到多个加速器上。流水线并行是训练大型DNNs常用的分片策略。然而，当前流水线并行的实现不经意间受到ML框架提供的自动微分工具的瓶颈限制。本文介绍了2阶段反向传播（2BP）。通过将反向传播步骤分为两个独立阶段，我们可以减少空闲计算时间。我们在各种模型架构和流水线调度上测试了2BP，在所有情况下都实现了吞吐量的增加。使用2BP，我们在训练具有70亿参数的类LLaMa变压器时，相较于传统方法，实现了吞吐量增加1.70倍，跨4个GPU。

English

As Deep Neural Networks (DNNs) grow in size and complexity, they often exceed the memory capacity of a single accelerator, necessitating the sharding of model parameters across multiple accelerators. Pipeline parallelism is a commonly used sharding strategy for training large DNNs. However, current implementations of pipeline parallelism are being unintentionally bottlenecked by the automatic differentiation tools provided by ML frameworks. This paper introduces 2-stage backpropagation (2BP). By splitting the backward propagation step into two separate stages, we can reduce idle compute time. We tested 2BP on various model architectures and pipelining schedules, achieving increases in throughput in all cases. Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods when training a LLaMa-like transformer with 7 billion parameters across 4 GPUs.

2BP：两阶段反向传播

2BP: 2-Stage Backpropagation

摘要

Support