veScale-FSDP: Flexibele en Hoogwaardige FSDP op Schaal

Samenvatting

Volledig Gefragmenteerd Data Parallel (FSDP), ook bekend als ZeRO, wordt veelvuldig gebruikt voor het trainen van grootschalige modellen vanwege zijn flexibiliteit en minimale ingreep in modelcode. Huidige FSDP-systemen hebben echter moeite met structuurbewuste trainingsmethoden (zoals bloksgewijze gekwantiseerde training) en met niet-elementgewijze optimalisatoren (zoals Shampoo en Muon) die in geavanceerde modellen (zoals Gemini, Kimi K2) worden gebruikt. De vaste element- of rijgewijze fragmentatieformats van FSDP conflicteren met de blokgestructureerde berekeningen. Daarnaast schieten huidige implementaties tekort in communicatie- en geheugenefficiëntie, wat schaalbaarheid naar tienduizenden GPU's beperkt. Wij introduceren veScale-FSDP, een herontworpen FSDP-systeem dat een flexibel fragmentatieformat, RaggedShard, combineert met een structuurbewust planningsalgoritme om zowel flexibiliteit als prestaties op schaal te leveren. veScale-FSDP ondersteunt van nature efficiënte dataplaatsing die FSDP vereist, waardoor bloksgewijze kwantisatie en niet-elementgewijze optimalisatoren mogelijk worden. Hierdoor behaalt veScale-FSDP een 5-66% hogere doorvoersnelheid en 16-30% lager geheugengebruik dan bestaande FSDP-systemen, terwijl het efficiënt schaalt naar tienduizenden GPU's.

English

Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP achieves 5~66% higher throughput and 16~30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.

veScale-FSDP: Flexibele en Hoogwaardige FSDP op Schaal

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Samenvatting

Support