重探大语言模型后训练中的参数服务器架构

摘要

现代数据并行（DP）训练因平衡负载下的简洁高效性，更倾向于采用集体通信而非参数服务器（PS）架构。然而在大语言模型（LLM）后训练阶段，由于序列长度的高方差性，平衡负载的假设不再成立。在负载不均衡场景下，集体通信会形成同步屏障，导致低负载设备利用率不足。这种训练动态的变化促使我们重新审视参数服务器范式对此类不均衡情况的鲁棒性。我们提出按需通信（ODC）方法，通过用直接点对点通信取代集体全收集和规约散射操作，将参数服务器理念融入全分片数据并行（FSDP）框架。相较于FSDP，ODC将同步屏障从每层一次降低为每小批次一次，并解耦各设备的工作负载，使快速计算设备免于停滞等待。该方法还能在小批次数级别实现更简洁有效的负载均衡。在多样化LLM后训练任务中，ODC持续提升设备利用率和训练吞吐量，较标准FSDP最高可实现36%的加速效果。这些结果表明ODC能更好地适应LLM后训练中普遍存在的负载不均衡场景。我们的ODC实现及与FSDP的集成方案已开源：https://github.com/sail-sg/odc。

English

Modern data parallel (DP) training favors collective communication over parameter servers (PS) for its simplicity and efficiency under balanced workloads. However, the balanced workload assumption no longer holds in large language model (LLM) post-training due to the high variance in sequence lengths. Under imbalanced workloads, collective communication creates synchronization barriers, leading to under-utilization of devices with smaller workloads. This change in training dynamics calls for a revisit of the PS paradigm for its robustness to such imbalance. We propose On-Demand Communication (ODC), which adapts PS into Fully Sharded Data Parallel (FSDP) by replacing collective all-gather and reduce-scatter with direct point-to-point communication. Compared to FSDP, ODC reduces the synchronization barrier from once per layer to once per minibatch and decouples the workload on each device so that faster workers are not stalled. It also enables simpler and more effective load balancing at the minibatch level. Across diverse LLM post-training tasks, ODC consistently improves device utilization and training throughput, achieving up to a 36\% speedup over standard FSDP. These results demonstrate that ODC is a superior fit for the prevalent imbalanced workloads in LLM post-training. Our implementation of ODC and integration with FSDP is open-sourced at https://github.com/sail-sg/odc.

重探大语言模型后训练中的参数服务器架构

Revisiting Parameter Server in LLM Post-Training

摘要

Support