T3：透明跟踪和触发器，用于计算和集合的细粒度重叠

摘要

大型语言模型在训练和推断中越来越依赖分布式技术。这些技术需要跨设备进行通信，随着设备数量的增加，可能会降低扩展效率。虽然一些分布式技术可以重叠，从而隐藏这种通信与独立计算之间的关系，但诸如张量并行（TP）之类的技术本质上会将通信与模型执行串行化。隐藏这种串行通信的一种方法是以细粒度的方式将其与生产者操作（通信数据的生产者）交错进行。然而，在软件中进行这种细粒度的通信和计算交错可能会很困难。此外，与任何并发执行一样，它需要在计算和通信之间共享计算和内存资源，导致资源争用，降低重叠效果。为了克服这些挑战，我们提出了T3，它应用硬件-软件共同设计，透明地重叠串行通信，同时最大程度减少与计算的资源争用。T3通过简单配置生产者的输出地址空间，透明地将生产者操作与随后的通信融合在一起，并且只需要进行轻微的软件更改。在硬件层面，T3添加了轻量级的跟踪和触发机制来协调生产者的计算和通信。它进一步利用增强计算的内存来进行通信的相关计算。因此，T3减少了资源争用，并有效地将串行通信与计算重叠在一起。对于像T-NLG这样重要的Transformer模型，T3将通信密集型子层的速度提高了30%的几何平均值（最大47%），并将数据移动减少了22%的几何平均值（最大36%）。此外，T3的好处在模型扩展时仍然存在：对于PALM和MT-NLG这样的sim500亿参数模型中的子层，几何平均值为29%。

English

Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in sim500-billion parameter models, PALM and MT-NLG.

T3：透明跟踪和触发器，用于计算和集合的细粒度重叠

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

摘要

Support