T3: 計算処理と集団通信の細粒度オーバーラップのための透明な追跡とトリガー

要旨

大規模言語モデルは、その訓練と推論において分散技術にますます依存するようになっています。これらの技術はデバイス間の通信を必要とし、デバイス数が増えるにつれてスケーリング効率が低下する可能性があります。一部の分散技術では、この通信を独立した計算と重ね合わせることで隠すことができますが、Tensor Parallelism（TP）のような技術では、通信とモデルの実行が本質的に直列化されます。この直列化された通信を隠す一つのアプローチは、通信をデータの生成操作と細粒度で交互に行うことです。しかし、ソフトウェアで通信と計算を細粒度で交互に行うことは困難です。さらに、並列実行の場合と同様に、計算と通信の間で計算リソースとメモリリソースを共有する必要があり、リソース競合が発生して重ね合わせの効果が低下します。これらの課題を克服するために、我々はT3を提案します。T3はハードウェアとソフトウェアの協調設計を適用し、直列化された通信を透過的に重ね合わせるとともに、計算とのリソース競合を最小化します。T3は、生成操作の出力アドレス空間を簡単に設定することで、生成操作とその後の通信を透過的に融合し、ソフトウェアの変更を最小限に抑えます。ハードウェアレベルでは、T3は軽量なトラックおよびトリガーメカニズムを追加して生成操作の計算と通信を調整します。さらに、通信に付随する計算のために計算機能を強化したメモリを使用します。その結果、T3はリソース競合を減らし、直列化された通信と計算を効率的に重ね合わせます。T-NLGのような重要なTransformerモデルでは、T3は通信が集中するサブレイヤーの速度を幾何平均で30%（最大47%）向上させ、データ移動を幾何平均で22%（最大36%）削減します。さらに、T3の利点はモデルがスケールする際にも持続します：sim5000億パラメータモデル、PALM、MT-NLGのサブレイヤーで幾何平均29%の改善が見られます。

English

Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in sim500-billion parameter models, PALM and MT-NLG.

T3: 計算処理と集団通信の細粒度オーバーラップのための透明な追跡とトリガー

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

要旨

Support