面向大语言模型系统的RDMA点对点通信技术
RDMA Point-to-Point Communication for LLM Systems
October 31, 2025
作者: Nandor Licker, Kevin Hu, Vladimir Zaytsev, Lequn Chen
cs.AI
摘要
新兴大型语言模型(LLM)系统范式——如分离式推理、专家混合路由和异步强化微调——需要超越传统集合通信的灵活点对点通信能力。现有实现方案受限于特定网络接口控制器,难以集成至推理引擎且缺乏跨硬件供应商的移植性。我们提出TransferEngine,通过桥接通用网卡功能提供统一接口。该系统在不依赖网络传输顺序假设的前提下,通过ImmCounter原语实现完成通知的单边WriteImm操作,并透明管理每块GPU对应的多块网卡。我们在NVIDIA ConnectX-7和AWS弹性结构适配器上均实现了400 Gbps的峰值吞吐量。通过三个生产系统展示TransferEngine的效能:(1)支持动态扩展的分离式推理KvCache传输;(2)万亿参数模型的强化学习权重更新仅需1.3秒;(3)在ConnectX-7上实现超越DeepEP解码延迟的MoE分发/聚合方案,并在EFA上首次达到可行延迟。实验证明我们的可移植点对点通信既能与集合通信形成互补,又可有效避免硬件绑定。
English
Emerging Large Language Model (LLM) system patterns, such as disaggregated
inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement
fine-tuning, require flexible point-to-point communication beyond simple
collectives. Existing implementations are locked to specific Network Interface
Controllers (NICs), hindering integration into inference engines and
portability across hardware providers. We present TransferEngine, which bridges
the functionality of common NICs to expose a uniform interface. TransferEngine
exposes one-sided WriteImm operations with a ImmCounter primitive for
completion notification, without ordering assumptions of network transport,
transparently managing multiple NICs per GPU. We demonstrate peak throughput of
400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We
showcase TransferEngine through three production systems: (1) KvCache transfer
for disaggregated inference with dynamic scaling, (2) RL weight updates
achieving 1.3 seconds for trillion-parameter models, and (3) MoE
dispatch/combine implementation exceeding DeepEP decode latency on ConnectX-7,
with the first viable latencies on EFA. We demonstrate that our portable
point-to-point communication complements collectives while avoiding lock-in.