ChatPaper.aiChatPaper

veScale-FSDP:大规模灵活且高性能的全共享数据并行策略

veScale-FSDP: Flexible and High-Performance FSDP at Scale

February 25, 2026
作者: Zezhou Wang, Youjie Li, Zhiqi Lin, Jiacheng Yang, Cong Xie, Guanyu Feng, Zheng Zhong, Ziyue Huang, Hongyu Zhu, Zhi Zhang, Yanghua Peng, Xin Liu
cs.AI

摘要

全分片数据并行(FSDP),亦称零冗余优化器(ZeRO),因其灵活性高且对模型代码侵入性极低的特点,被广泛应用于大规模模型训练。然而,现有FSDP系统难以兼容结构感知训练方法(如块式量化训练),也无法适配前沿模型(如Gemini、Kimi K2)采用的非逐元素优化器(如Shampoo、Muon)。FSDP固定的逐元素或逐行分片格式与块状结构计算模式存在冲突。此外,当前实现方案在通信效率和内存利用率方面存在不足,限制了其向数万张GPU的扩展能力。我们提出veScale-FSDP——一种重新设计的FSDP系统,通过将灵活分片格式RaggedShard与结构感知规划算法相结合,实现大规模训练时的灵活性与高性能。veScale-FSDP原生支持FSDP所需的高效数据布局,赋能块式量化训练与非逐元素优化器。实验表明,相较于现有FSDP系统,veScale-FSDP可实现5~66%的吞吐量提升与16~30%的内存占用降低,并能高效扩展至数万张GPU规模。
English
Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP achieves 5~66% higher throughput and 16~30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.
PDF42February 28, 2026