PipeOffload: メモリ最適化によるパイプライン並列処理のスケーラビリティ向上

要旨

パイプライン並列処理（PP）は大規模言語モデル（LLM）のトレーニングに広く使用されていますが、そのスケーラビリティは、PPの次数が増えるにつれてインフライトマイクロバッチの数が増加し、アクティベーションメモリ消費が高くなることによって制約されることがよくあります。本論文では、PPにおける未開拓のメモリオフロード戦略を活用してこの課題に取り組むことに焦点を当てます。実証研究を通じて、標準的な構成の大多数において、少なくとも半分、場合によってはすべてのアクティベーションを無視できるオーバーヘッドでオフロードできることを発見しました。完全なオフロードが不可能な場合には、ピークアクティベーションメモリを線形以上に減少させる新たな選択的オフロード戦略を導入します。さらに、メモリオフロードを他の技術と統合し、全体のスループットとメモリ制限を共同で考慮します。私たちの実験では、デバイスごとのアクティベーションメモリがステージの総数に応じて効果的に減少し、PPがTPよりも強力な代替手段となり、メモリ消費をさらに低く抑えながら最大19％の加速を提供することが証明されました。実装はhttps://github.com/sail-sg/zero-bubble-pipeline-parallelism{このURL}でオープンソース化されています。

English

Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption. The implementation is open-sourced at https://github.com/sail-sg/zero-bubble-pipeline-parallelism{this url}.

PipeOffload: メモリ最適化によるパイプライン並列処理のスケーラビリティ向上

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

要旨

Support