重探大语言模型后训练中的参数服务器架构

摘要

现代数据并行（DP）训练因其在均衡工作负载下的简洁性和高效性，通常倾向于采用集体通信而非参数服务器（PS）架构。然而在大语言模型（LLM）的后训练阶段，由于序列长度存在高度差异性，均衡工作负载的假设不再成立。在负载不均衡场景下，集体通信会形成同步屏障，导致低负载设备的算力利用率不足。这种训练动态的变化促使我们重新审视参数服务器范式对此类不均衡情况的适应性。我们提出按需通信（ODC）方法，通过用直接点对点通信替代集体全收集与规约散射操作，将参数服务器理念融入全分片数据并行（FSDP）框架。相较于FSDP，ODC将同步屏障从每层一次降低为每小批次一次，并解耦各设备的工作负载，使高速运算单元免于阻塞。该方法还能实现更简洁有效的小批次级负载均衡。在多样化的LLM后训练任务中，ODC持续提升设备利用率和训练吞吐量，较标准FSDP最高可实现36%的加速效果。这些结果表明ODC能更好地适应LLM后训练中普遍存在的不均衡工作负载。我们的ODC实现及与FSDP的集成方案已开源：https://github.com/sail-sg/odc。

English

Modern data parallel (DP) training favors collective communication over parameter servers (PS) for its simplicity and efficiency under balanced workloads. However, the balanced workload assumption no longer holds in large language model (LLM) post-training due to the high variance in sequence lengths. Under imbalanced workloads, collective communication creates synchronization barriers, leading to under-utilization of devices with smaller workloads. This change in training dynamics calls for a revisit of the PS paradigm for its robustness to such imbalance. We propose On-Demand Communication (ODC), which adapts PS into Fully Sharded Data Parallel (FSDP) by replacing collective all-gather and reduce-scatter with direct point-to-point communication. Compared to FSDP, ODC reduces the synchronization barrier from once per layer to once per minibatch and decouples the workload on each device so that faster workers are not stalled. It also enables simpler and more effective load balancing at the minibatch level. Across diverse LLM post-training tasks, ODC consistently improves device utilization and training throughput, achieving up to a 36\% speedup over standard FSDP. These results demonstrate that ODC is a superior fit for the prevalent imbalanced workloads in LLM post-training. Our implementation of ODC and integration with FSDP is open-sourced at https://github.com/sail-sg/odc.

重探大语言模型后训练中的参数服务器架构

Revisiting Parameter Server in LLM Post-Training

摘要

Support