단일 GPU에서 100B 모델 미세 조정을 가능하게 하고 가속화하기 위해 NVMe SSD 추가

초록

대규모 언어 모델의 최근 발전은 그들이 활용하는 방대한 수의 파라미터로 인해 뛰어난 능력을 발휘하며 세상에 엄청난 가치를 가져다주었습니다. 그러나 현재 최대 80GB의 메모리 용량을 가진 가장 고성능의 GPU조차도 확률적 경사 하강법 기반 최적화를 수행할 때 이러한 방대한 파라미터와 관련된 최적화 상태를 수용하기에는 턱없이 부족합니다. 이러한 거대 모델을 호스팅하기 위한 한 가지 접근 방식은 여러 GPU의 장치 메모리를 집계하는 것입니다. 그러나 이 방법은 대부분의 학술 연구자들에게는 비용이 너무 많이 들어, 고성능 GPU 서버를 구입하기 위한 예산이 항상 제한적입니다. 본 논문에서는 대부분의 AI 연구자들이 접근할 수 있는 일반 서버의 단일, 심지어 저사양 GPU에서 거대 모델 미세 조정에 초점을 맞춥니다. 이러한 시나리오에서 최첨단 작업인 ZeRO-Infinity는 일반 서버에서 실행할 때 두 가지 심각한 문제를 겪습니다: 1) 비효율적인 스와핑으로 인한 낮은 GPU 활용률, 그리고 2) CPU 메모리 용량으로 인한 제한된 학습 가능 모델 크기입니다. 근본적인 이유는 ZeRO-Infinity가 고성능 GPU 서버에서 실행되도록 최적화되어 있기 때문입니다. 이를 위해, 저사양 GPU와 제한된 CPU 메모리 용량을 가진 저사양 서버에서 효율적인 100B 거대 모델 미세 조정을 가능하게 하는 저비용 학습 프레임워크인 Fuyou를 제시합니다. 핵심 아이디어는 SSD-CPU 통신을 최적화 차원으로 추가하여 체계적인 접근 방식으로 계산과 데이터 스와핑을 신중하게 공동 최적화하여 GPU 활용률을 극대화하는 것입니다. 실험 결과는 1) Fuyou가 소비자용 GPU RTX 4090에서 175B GPT-3을 높은 GPU 활용률로 미세 조정할 수 있는 반면, ZeRO-Infinity는 미세 조정에 실패한다는 것, 그리고 2) 작은 GPT-3 13B 모델을 학습할 때 Fuyou가 RTX 4090 GPU에서 156 TFLOPS를 달성하는 반면, ZeRO-Infinity는 단지 45 TFLOPS만 달성한다는 것을 보여줍니다.

English

Recent advances in large language models have brought immense value to the world, with their superior capabilities stemming from the massive number of parameters they utilize. However, even the GPUs with the highest memory capacities, currently peaking at 80GB, are far from sufficient to accommodate these vast parameters and their associated optimizer states when conducting stochastic gradient descent-based optimization. One approach to hosting such huge models is to aggregate device memory from many GPUs. However, this approach introduces prohibitive costs for most academic researchers, who always have a limited budget for many high-end GPU servers. In this paper, we focus on huge model fine-tuning on a single, even low-end, GPU in a commodity server, which is accessible to most AI researchers. In such a scenario, the state-of-the-art work ZeRO-Infinity suffers from two severe issues when running in a commodity server: 1) low GPU utilization due to inefficient swapping, and 2) limited trainable model size due to CPU memory capacity. The underlying reason is that ZeRO-Infinity is optimized for running on high-end GPU servers. To this end, we present Fuyou, a low-cost training framework that enables efficient 100B huge model fine-tuning on a low-end server with a low-end GPU and limited CPU memory capacity. The key idea is to add the SSD-CPU communication as an optimization dimension and thus carefully co-optimize computation and data swapping from a systematic approach to maximize GPU utilization. The experimental results show that 1) Fuyou is able to fine-tune 175B GPT-3 on a consumer GPU RTX 4090 with high GPU utilization, while ZeRO-Infinity fails to fine-tune; and 2) when training a small GPT-3 13B model, Fuyou achieves 156 TFLOPS on an RTX 4090 GPU while ZeRO-Infinity only achieves 45 TFLOPS.

단일 GPU에서 100B 모델 미세 조정을 가능하게 하고 가속화하기 위해 NVMe SSD 추가

Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU

초록

Support