T3：透明追踪與觸發，用於計算和集合的精細重疊

摘要

大型語言模型在訓練和推斷中越來越依賴分佈式技術。這些技術需要跨設備進行通信，隨著設備數量的增加，可能會降低擴展效率。雖然一些分佈式技術可以重疊，從而隱藏這種通信與獨立計算，但諸如「張量平行性」（Tensor Parallelism，TP）之類的技術在模型執行中固有地序列化通信。一種隱藏這種序列化通信的方法是以精細的方式將其與生產者操作（通信數據的生產者）交織在一起。然而，在軟件中進行這種通信和計算的精細交織可能會很困難。此外，與任何並行執行一樣，它需要在計算和通信之間共享計算和內存資源，這將導致資源爭奪，進而降低重疊效率。為了克服這些挑戰，我們提出了T3，它應用硬體軟體協同設計，以在最小化與計算的資源爭奪的同時透明地重疊序列化通信。T3通過對生產者的輸出地址空間進行簡單配置，要求進行輕微軟體更改，透明地將生產者操作與後續通信融合在一起。在硬體層面，T3添加了輕量級的跟踪和觸發機制來協調生產者的計算和通信。它進一步利用增強計算的記憶體來進行通信的相關計算。因此，T3減少了資源爭奪，並有效地將序列化通信與計算重疊。對於像T-NLG這樣重要的Transformer模型，T3將通信密集的子層加速了30%的幾何平均值（最大47%），並將數據移動減少了22%的幾何平均值（最大36%）。此外，T3的好處在模型擴展時仍然存在：對於sim500億參數模型PALM和MT-NLG中的子層，幾何平均值為29%。

English

Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in sim500-billion parameter models, PALM and MT-NLG.

T3：透明追踪與觸發，用於計算和集合的精細重疊

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

摘要

Support