パイプライン並列処理と語彙並列処理のバランス調整

要旨

パイプライン並列処理は、トランスフォーマーベースの大規模言語モデルのトレーニングをスケーリングするために広く使用されており、そのスループットとメモリフットプリントを改善するためのさまざまな取り組みが行われています。本論文では、頻繁に見落とされがちな問題に取り組んでいます。つまり、語彙レイヤーがパイプライン段階全体での計算とメモリ使用量の不均衡を引き起こし、パイプラインの遅延やメモリボトルネックを悪化させる可能性があります。この問題に対処するために、語彙レイヤーをパイプラインデバイス全体に均等に分割し、計算をパイプラインパスにグループ化します。アクティベーションメモリのオーバーヘッドを削減するために、語彙レイヤー内での通信障壁を減らすためのいくつかのアルゴリズムを提案しています。さらに、既存のパイプラインスケジュールに語彙並列処理を統合する汎用的な手法を利用しています。これらの手法を組み合わせることで、我々の手法は計算とパラメータメモリを効果的にバランスさせ、わずかな定数のアクティベーションメモリオーバーヘッドで完全なバランスを実現します。特に、V-Halfなどのアクティベーションメモリがバランスされたスケジュールと組み合わせると、我々の手法はメモリと計算の両方で完全なバランスを達成します。包括的な評価により、我々の手法は語彙サイズに関係なく計算とメモリのバランスを達成し、単純なアプローチと比較してスループットが5%から51%向上し、特に大規模な語彙のシナリオにおいてピークメモリ使用量を著しく削減します。当該手法の実装は、https://github.com/sail-sg/VocabularyParallelism でオープンソースとして公開されています。

English

Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5% to 51% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios. Our implementation is open-sourced at https://github.com/sail-sg/VocabularyParallelism .

パイプライン並列処理と語彙並列処理のバランス調整

Balancing Pipeline Parallelism with Vocabulary Parallelism

要旨

Support