veScale-FSDP: 대규모 환경에서 유연하고 고성능을 제공하는 FSDP

초록

완전 분산 데이터 병렬(FSDP, 일명 ZeRO)은 대규모 모델 학습에 널리 사용되며 유연성과 모델 코드에 대한 최소한의 침습성을 특징으로 합니다. 그러나 현재 FSDP 시스템은 구조 인식 학습 방법(예: 블록 단위 양자화 학습)과 최첨단 모델(예: Gemini, Kimi K2)에서 사용되는 비 요소 단위 최적화 도구(예: Shampoo, Muon)에 대해 어려움을 겪고 있습니다. FSDP의 고정된 요소 또는 행 단위 분산 형식은 블록 구조 연산과 충돌합니다. 또한 현재 구현체는 통신 및 메모리 효율성 측면에서 부족하여 수만 개의 GPU로의 확장을 제한하고 있습니다. 본 논문에서는 유연한 분산 형식인 RaggedShard와 구조 인식 계획 알고리즘을 결합하여 대규모로 유연성과 성능을 동시에 제공하는 재설계된 FSDP 시스템인 veScale-FSDP를 소개합니다. veScale-FSDP는 FSDP에 필요한 효율적인 데이터 배치를 기본적으로 지원하여 블록 단위 양자화 및 비 요소 단위 최적화 도구를 구현합니다. 그 결과, veScale-FSDP는 기존 FSDP 시스템 대비 5~66% 높은 처리량과 16~30% 낮은 메모리 사용량을 달성하면서 수만 개의 GPU로 효율적으로 확장됩니다.

English

Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP achieves 5~66% higher throughput and 16~30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.

veScale-FSDP: 대규모 환경에서 유연하고 고성능을 제공하는 FSDP

veScale-FSDP: Flexible and High-Performance FSDP at Scale

초록

Support