T3: 컴퓨팅과 집합 연산의 세밀한 중첩을 위한 투명한 추적 및 트리거링

초록

대규모 언어 모델은 점점 더 훈련과 추론을 위해 분산 기술에 의존하고 있습니다. 이러한 기술은 장치 간의 통신을 필요로 하며, 장치 수가 증가함에 따라 확장 효율성을 감소시킬 수 있습니다. 일부 분산 기술은 독립적인 계산과 통신을 중첩시켜 통신을 숨길 수 있지만, 텐서 병렬화(Tensor Parallelism, TP)와 같은 기술은 본질적으로 통신을 모델 실행과 직렬화합니다. 이러한 직렬화된 통신을 숨기기 위한 한 가지 접근 방식은 통신 데이터의 생산자 연산과 세밀하게 인터리빙하는 것입니다. 그러나 소프트웨어에서 통신과 계산을 세밀하게 인터리빙하는 것은 어려울 수 있습니다. 또한, 모든 동시 실행과 마찬가지로 계산과 통신 간에 컴퓨팅 및 메모리 리소스를 공유해야 하므로 리소스 경쟁이 발생하여 중첩 효율성이 감소합니다. 이러한 문제를 극복하기 위해, 우리는 하드웨어-소프트웨어 공동 설계를 적용하여 직렬화된 통신을 투명하게 중첩시키면서 계산과의 리소스 경쟁을 최소화하는 T3를 제안합니다. T3는 생산자 연산의 출력 주소 공간을 간단히 구성함으로써 생산자 연산과 후속 통신을 투명하게 융합하며, 소프트웨어 변경을 최소화합니다. 하드웨어 수준에서 T3는 생산자의 계산과 통신을 조율하기 위해 경량의 트랙 및 트리거 메커니즘을 추가합니다. 또한, 통신에 수반되는 계산을 위해 계산 강화 메모리를 사용합니다. 결과적으로 T3는 리소스 경쟁을 줄이고 직렬화된 통신을 계산과 효율적으로 중첩시킵니다. T-NLG와 같은 중요한 트랜스포머 모델에서 T3는 통신이 많은 서브 레이어를 지오메트릭 평균 30%(최대 47%)까지 가속화하고 데이터 이동을 지오메트릭 평균 22%(최대 36%)까지 줄입니다. 또한, T3의 이점은 모델이 확장됨에 따라 지속됩니다: sim500-빌리언 파라미터 모델, PALM 및 MT-NLG에서 서브 레이어의 지오메트릭 평균 29%입니다.

English

Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-software co-design to transparently overlap serialized communication while minimizing resource contention with compute. T3 transparently fuses producer operations with the subsequent communication via a simple configuration of the producer's output address space and requires minor software changes. At the hardware level, T3 adds a lightweight track and trigger mechanism to orchestrate the producer's compute, and communication. It further uses compute-enhanced memories for communication's attendant compute. As a result, T3 reduces resource contention, and efficiently overlaps serialized communication with computation. For important Transformer models like T-NLG, T3 speeds up communication-heavy sublayers by 30% geomean (max 47%) and reduces data movement by 22% geomean (max 36%). Furthermore, T3's benefits persist as models scale: geomean 29% for sublayers in sim500-billion parameter models, PALM and MT-NLG.

T3: 컴퓨팅과 집합 연산의 세밀한 중첩을 위한 투명한 추적 및 트리거링

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

초록

Support