在單個GPU上添加NVMe固態硬盤以啟用和加速對100億參數模型的微調

摘要

近年來大型語言模型的最新進展為世界帶來了巨大價值，其卓越能力源於其所利用的龐大參數數量。然而，即使是目前記憶體容量最高的 GPU，目前峰值為 80GB，仍遠遠不足以容納這些龐大參數及其相關的優化器狀態，當進行基於隨機梯度下降的優化時。一種容納這種龐大模型的方法是從多個 GPU 聚合設備記憶體。然而，這種方法對於大多數學術研究人員來說成本過高，他們總是對許多高端 GPU 伺服器的預算有限。本文專注於在商品伺服器上單個、甚至低端 GPU 上進行龐大模型微調，這對大多數 AI 研究人員來說是可訪問的。在這種情況下，最先進的 ZeRO-Infinity 在商品伺服器上運行時存在兩個嚴重問題：1) 由於效率低下的交換，GPU 利用率低，以及 2) 由於 CPU 記憶體容量有限，可訓練模型大小受限。其根本原因是 ZeRO-Infinity 經過優化以在高端 GPU 伺服器上運行。為此，我們提出了一個低成本的訓練框架 Fuyou，可以實現在低端伺服器上的低端 GPU 和有限 CPU 記憶體容量上高效進行 100B 龐大模型微調。其關鍵思想是將 SSD-CPU 通信作為一個優化維度，因此從系統化方法中精心協同優化計算和數據交換，以最大程度地提高 GPU 利用率。實驗結果表明：1) Fuyou 能夠在消費 GPU RTX 4090 上高效微調 175B GPT-3，而 ZeRO-Infinity 則無法進行微調；以及 2) 在訓練小型 GPT-3 13B 模型時，Fuyou 在 RTX 4090 GPU 上實現 156 TFLOPS，而 ZeRO-Infinity 僅實現 45 TFLOPS。

English

Recent advances in large language models have brought immense value to the world, with their superior capabilities stemming from the massive number of parameters they utilize. However, even the GPUs with the highest memory capacities, currently peaking at 80GB, are far from sufficient to accommodate these vast parameters and their associated optimizer states when conducting stochastic gradient descent-based optimization. One approach to hosting such huge models is to aggregate device memory from many GPUs. However, this approach introduces prohibitive costs for most academic researchers, who always have a limited budget for many high-end GPU servers. In this paper, we focus on huge model fine-tuning on a single, even low-end, GPU in a commodity server, which is accessible to most AI researchers. In such a scenario, the state-of-the-art work ZeRO-Infinity suffers from two severe issues when running in a commodity server: 1) low GPU utilization due to inefficient swapping, and 2) limited trainable model size due to CPU memory capacity. The underlying reason is that ZeRO-Infinity is optimized for running on high-end GPU servers. To this end, we present Fuyou, a low-cost training framework that enables efficient 100B huge model fine-tuning on a low-end server with a low-end GPU and limited CPU memory capacity. The key idea is to add the SSD-CPU communication as an optimization dimension and thus carefully co-optimize computation and data swapping from a systematic approach to maximize GPU utilization. The experimental results show that 1) Fuyou is able to fine-tune 175B GPT-3 on a consumer GPU RTX 4090 with high GPU utilization, while ZeRO-Infinity fails to fine-tune; and 2) when training a small GPT-3 13B model, Fuyou achieves 156 TFLOPS on an RTX 4090 GPU while ZeRO-Infinity only achieves 45 TFLOPS.

在單個GPU上添加NVMe固態硬盤以啟用和加速對100億參數模型的微調

Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU

摘要

Support