AMSP: 고급 모델 상태 분할을 통한 대규모 언어 모델 학습의 초확장

초록

대규모 언어 모델(LLMs)은 다양한 하위 작업에서 인상적인 성능을 보여주고 있습니다. 이러한 모델을 훈련시킬 때, 더 큰 훈련 규모에서 더 많은 토큰을 처리하되 상대적으로 더 작은 모델 크기를 유지하려는 경향이 증가하고 있습니다. 제로 리던던시 옵티마이저(ZeRO)는 기존 훈련 환경에서는 효과적이지만, 이러한 새로운 패러다임에 직면할 때 확장성 문제에 부딪힙니다. 이를 해결하기 위해, 우리는 모델 상태를 세분화하여 파라미터(P), 그래디언트(G), 그리고 옵티마이저 상태(OS)를 포함하는 새로운 LLM 훈련 프레임워크인 AMSP를 제안합니다. 구체적으로, AMSP는 (1) P, G, OS에 대한 독립적인 분할 전략을 가능하게 하는 통합 분할 공간을 구축하고, (2) 최적의 분할 전략을 자동으로 탐색하기 위해 규모 인식 분할기를 통합하며, (3) 다양한 분할 전략으로 인해 발생하는 데이터 배치 불일치를 효율적으로 관리하기 위한 전용 통신 최적화기를 설계합니다. 우리의 평가 결과, AMSP는 1024개의 GPU에서 최대 90.3%의 확장 효율성을 달성했습니다.

English

Large Language Models (LLMs) have demonstrated impressive performance across various downstream tasks. When training these models, there is a growing inclination to process more tokens on larger training scales but with relatively smaller model sizes. Zero Redundancy Optimizer (ZeRO), although effective in conventional training environments, grapples with scaling challenges when confronted with this emerging paradigm. To this end, we propose a novel LLM training framework AMSP, which undertakes a granular partitioning of model states, encompassing parameters (P), gradient (G), and optimizer states (OS). Specifically, AMSP(1) builds a unified partitioning space, enabling independent partitioning strategies for P, G, and OS; (2) incorporates a scale-aware partitioner to autonomously search for optimal partitioning strategies: (3) designs a dedicated communication optimizer to ensure proficient management of data placement discrepancies arising from diverse partitioning strategies. Our evaluations show that AMSP achieves up to 90.3% scaling efficiency across 1024 GPUs.

AMSP: 고급 모델 상태 분할을 통한 대규모 언어 모델 학습의 초확장

AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning

초록

Support