PipeOffload：通過記憶體優化提升管道並行處理的可擴展性

摘要

流水線並行（Pipeline Parallelism, PP）被廣泛用於訓練大型語言模型（LLMs），但其可擴展性往往受到高激活記憶體消耗的限制，因為隨著PP程度的增加，在處理中的微批次數量也會增長。本文聚焦於通過利用PP中尚未充分探索的記憶體卸載策略來應對這一挑戰。通過實證研究，我們發現，在大多數標準配置下，至少一半甚至全部的激活數據可以以可忽略的開銷進行卸載。在無法實現完全卸載的情況下，我們引入了一種新穎的選擇性卸載策略，該策略以優於線性的方式降低了峰值激活記憶體。此外，我們將記憶體卸載與其他技術相結合，綜合考慮整體吞吐量和記憶體限制。實驗證明，每設備的激活記憶體隨著總階段數的增加而有效減少，使得PP成為比張量並行（TP）更優的選擇，在記憶體消耗更低的情況下，最高可實現19%的加速。該實現已開源於https://github.com/sail-sg/zero-bubble-pipeline-parallelism{此網址}。

English

Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption. The implementation is open-sourced at https://github.com/sail-sg/zero-bubble-pipeline-parallelism{this url}.

PipeOffload：通過記憶體優化提升管道並行處理的可擴展性

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

摘要

Support