EE-LLM:具有3D并行性的早期退出大规模语言模型的训练和推断
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
December 8, 2023
作者: Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou
cs.AI
摘要
我们提出了EE-LLM,这是一个用于大规模训练和推断的早期退出大型语言模型(LLMs)的框架。尽管最近的研究已经展示了早期退出在加速LLM推断方面的有效性的初步证据,但EE-LLM迈出了一个基础性的步骤,通过支持使用大规模3D并行性进行早期退出LLMs的训练和推断。基于Megatron-LM构建的EE-LLM实现了各种算法创新和性能优化,专门针对早期退出进行了定制,包括一种轻量级方法,利用流水线并行性促进早期退出训练目标的反向传播,利用原始流水线调度中的空闲资源进行与早期退出层相关的计算的技术,以及两种与KV缓存兼容的早期退出推断方法,用于自回归生成。我们的分析和实证研究表明,与标准LLM训练相比,EE-LLM实现了出色的训练效率,几乎没有计算开销,并且在不影响输出质量的情况下实现了出色的推断加速。为了促进进一步的研究和采用,我们在https://github.com/pan-x-c/EE-LLM上发布了EE-LLM。
English
We present EE-LLM, a framework for large-scale training and inference of
early-exit large language models (LLMs). While recent works have shown
preliminary evidence for the efficacy of early exiting in accelerating LLM
inference, EE-LLM makes a foundational step towards scaling up early-exit LLMs
by supporting their training and inference with massive 3D parallelism. Built
upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and
performance optimizations tailored to early exiting, including a lightweight
method that facilitates backpropagation for the early-exit training objective
with pipeline parallelism, techniques of leveraging idle resources in the
original pipeline schedule for computation related to early-exit layers, and
two approaches of early-exit inference that are compatible with KV caching for
autoregressive generation. Our analytical and empirical study shows that EE-LLM
achieves great training efficiency with negligible computational overhead
compared to standard LLM training, as well as outstanding inference speedup
without compromising output quality. To facilitate further research and
adoption, we release EE-LLM at https://github.com/pan-x-c/EE-LLM.