T3:透明追踪與觸發,用於計算和集合的精細重疊
T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
January 30, 2024
作者: Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair
cs.AI
摘要
大型語言模型在訓練和推斷中越來越依賴分佈式技術。這些技術需要跨設備進行通信,隨著設備數量的增加,可能會降低擴展效率。雖然一些分佈式技術可以重疊,從而隱藏這種通信與獨立計算,但諸如「張量平行性」(Tensor Parallelism,TP)之類的技術在模型執行中固有地序列化通信。一種隱藏這種序列化通信的方法是以精細的方式將其與生產者操作(通信數據的生產者)交織在一起。然而,在軟件中進行這種通信和計算的精細交織可能會很困難。此外,與任何並行執行一樣,它需要在計算和通信之間共享計算和內存資源,這將導致資源爭奪,進而降低重疊效率。
為了克服這些挑戰,我們提出了T3,它應用硬體軟體協同設計,以在最小化與計算的資源爭奪的同時透明地重疊序列化通信。T3通過對生產者的輸出地址空間進行簡單配置,要求進行輕微軟體更改,透明地將生產者操作與後續通信融合在一起。在硬體層面,T3添加了輕量級的跟踪和觸發機制來協調生產者的計算和通信。它進一步利用增強計算的記憶體來進行通信的相關計算。因此,T3減少了資源爭奪,並有效地將序列化通信與計算重疊。對於像T-NLG這樣重要的Transformer模型,T3將通信密集的子層加速了30%的幾何平均值(最大47%),並將數據移動減少了22%的幾何平均值(最大36%)。此外,T3的好處在模型擴展時仍然存在:對於sim500億參數模型PALM和MT-NLG中的子層,幾何平均值為29%。
English
Large Language Models increasingly rely on distributed techniques for their
training and inference. These techniques require communication across devices
which can reduce scaling efficiency as the number of devices increases. While
some distributed techniques can overlap, and thus, hide this communication with
independent computations, techniques such as Tensor Parallelism (TP) inherently
serialize communication with model execution. One approach to hide this
serialized communication is to interleave it with the producer operation (of
the communicated data) in a fine-grained manner. However, this fine-grained
interleaving of communication and computation in software can be difficult.
Furthermore, as with any concurrent execution, it requires compute and memory
resources to be shared between computation and communication, causing resource
contention that reduces overlapping efficacy.
To overcome these challenges, we propose T3 which applies hardware-software
co-design to transparently overlap serialized communication while minimizing
resource contention with compute. T3 transparently fuses producer operations
with the subsequent communication via a simple configuration of the producer's
output address space and requires minor software changes. At the hardware
level, T3 adds a lightweight track and trigger mechanism to orchestrate the
producer's compute, and communication. It further uses compute-enhanced
memories for communication's attendant compute. As a result, T3 reduces
resource contention, and efficiently overlaps serialized communication with
computation. For important Transformer models like T-NLG, T3 speeds up
communication-heavy sublayers by 30% geomean (max 47%) and reduces data
movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models
scale: geomean 29% for sublayers in sim500-billion parameter models, PALM
and MT-NLG.