veScale-FSDP: Flexible und hochperformante FSDP-Implementierung im großen Maßstab

Zusammenfassung

Fully Sharded Data Parallel (FSDP), auch bekannt als ZeRO, wird häufig für das Training großskaliger Modelle eingesetzt und zeichnet sich durch seine Flexibilität und minimale Eingriffe in den Modellcode aus. Allerdings haben aktuelle FSDP-Systeme Schwierigkeiten mit strukturorientierten Trainingsmethoden (z. B. blockweise quantisiertes Training) und mit nicht-elementweisen Optimierern (z. B. Shampoo und Muon), die in modernsten Modellen (z. B. Gemini, Kimi K2) verwendet werden. Die festen element- oder zeilenweisen Sharding-Formate von FSDP stehen im Konflikt mit blockstrukturierten Berechnungen. Darüber hinaus weisen heutige Implementierungen Defizite in der Kommunikations- und Speichereffizienz auf, was die Skalierung auf Zehntausende von GPUs begrenzt. Wir stellen veScale-FSDP vor, ein neu gestaltetes FSDP-System, das ein flexibles Sharding-Format, RaggedShard, mit einem strukturorientierten Planungsalgorithmus kombiniert, um sowohl Flexibilität als auch Leistung im großen Maßstab zu bieten. veScale-FSDP unterstützt nativ die effiziente Datenplatzierung, die von FSDP benötigt wird, und ermöglicht so blockweise Quantisierung und nicht-elementweise Optimierer. Infolgedessen erzielt veScale-FSDP einen um 5–66 % höheren Durchsatz und einen um 16–30 % geringeren Speicherverbrauch als bestehende FSDP-Systeme, während es effizient auf Zehntausende von GPUs skaliert.

English

Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP achieves 5~66% higher throughput and 16~30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.

veScale-FSDP: Flexible und hochperformante FSDP-Implementierung im großen Maßstab

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Zusammenfassung

Support