veScale-FSDP: スケールにおける柔軟かつ高性能なFSDP

要旨

Fully Sharded Data Parallel（FSDP）、別名ZeROは、大規模モデルの学習に広く用いられており、その柔軟性とモデルコードへの最小限の侵入性が特徴です。しかし、現行のFSDPシステムは、構造を考慮した学習手法（例：ブロック単位の量子化学習）や、先進的なモデル（例：Gemini、Kimi K2）で使用される非要素単位のオプティマイザ（例：Shampoo、Muon）に対応するのに苦戦しています。FSDPの固定された要素単位または行単位のシャーディング形式は、ブロック構造を持つ計算と矛盾します。さらに、現在の実装は通信とメモリ効率において不十分であり、数万GPUへのスケーリングを制限しています。本論文では、柔軟なシャーディング形式「RaggedShard」と構造を考慮した計画アルゴリズムを組み合わせ、スケール時の柔軟性と性能の両方を実現するように再設計されたFSDPシステム、veScale-FSDPを提案します。veScale-FSDPは、FSDPが必要とする効率的なデータ配置をネイティブにサポートし、ブロック単位の量子化と非要素単位のオプティマイザを可能にします。その結果、veScale-FSDPは既存のFSDPシステムと比較して、5～66%高いスループットと16～30%低いメモリ使用量を達成し、数万GPUへの効率的なスケーリングを実現します。

English

Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods (e.g., block-wise quantized training) and with non-element-wise optimizers (e.g., Shampoo and Muon) used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today's implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce veScale-FSDP, a redesigned FSDP system that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. veScale-FSDP natively supports efficient data placement required by FSDP, empowering block-wise quantization and non-element-wise optimizers. As a result, veScale-FSDP achieves 5~66% higher throughput and 16~30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.

veScale-FSDP: スケールにおける柔軟かつ高性能なFSDP

veScale-FSDP: Flexible and High-Performance FSDP at Scale

要旨

Support