PipeOffload: 메모리 최적화를 통한 파이프라인 병렬 처리 확장성 개선

초록

파이프라인 병렬화(PP)는 대규모 언어 모델(LLM) 학습에 널리 사용되지만, PP의 정도가 증가함에 따라 동시에 처리되는 마이크로배치의 수가 늘어나면서 높은 활성화 메모리 소비로 인해 확장성이 제한되는 경우가 많습니다. 본 논문에서는 PP에서 잘 활용되지 않은 메모리 오프로드 전략을 활용하여 이 문제를 해결하는 데 초점을 맞춥니다. 실험적 연구를 통해 표준 구성의 대부분에서 활성화의 절반 이상, 심지어는 전부를 오버헤드 없이 오프로드할 수 있음을 발견했습니다. 완전한 오프로드가 불가능한 경우에는 피크 활성화 메모리를 선형보다 더 나은 방식으로 감소시키는 새로운 선택적 오프로드 전략을 제안합니다. 또한, 메모리 오프로드를 다른 기법들과 통합하여 전체 처리량과 메모리 제약을 함께 고려합니다. 실험 결과, 장치당 활성화 메모리가 총 스테이지 수에 따라 효과적으로 감소함을 확인했으며, 이는 PP를 TP보다 더 강력한 대안으로 만들어 최대 19%의 가속과 더 낮은 메모리 소비를 제공합니다. 구현은 https://github.com/sail-sg/zero-bubble-pipeline-parallelism{이 URL}에서 오픈소스로 공개되었습니다.

English

Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption. The implementation is open-sourced at https://github.com/sail-sg/zero-bubble-pipeline-parallelism{this url}.

PipeOffload: 메모리 최적화를 통한 파이프라인 병렬 처리 확장성 개선

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

초록

Support