EE-LLM: 3D並列処理を用いた大規模な早期終了型大規模言語モデルの学習と推論

要旨

我々は、早期終了型大規模言語モデル（LLM）の大規模な訓練と推論のためのフレームワークであるEE-LLMを提案する。最近の研究では、LLM推論の高速化における早期終了の有効性が示唆されているが、EE-LLMは、大規模な3D並列処理を用いて早期終了型LLMの訓練と推論をサポートすることで、この分野における基盤的な一歩を踏み出した。Megatron-LMを基盤として構築されたEE-LLMは、早期終了に特化した様々なアルゴリズムの革新と性能最適化を実装している。これには、パイプライン並列処理を用いた早期終了訓練目的のための軽量な逆伝播手法、元のパイプラインスケジュールにおけるアイドルリソースを活用して早期終出層に関連する計算を行う技術、そして自己回帰生成におけるKVキャッシュと互換性のある2つの早期終了推論手法が含まれる。我々の分析と実証研究により、EE-LLMは標準的なLLM訓練と比較して無視できる程度の計算オーバーヘッドで高い訓練効率を達成し、出力品質を損なうことなく優れた推論速度向上を実現することが示された。さらなる研究と採用を促進するため、我々はEE-LLMをhttps://github.com/pan-x-c/EE-LLMで公開している。

English

We present EE-LLM, a framework for large-scale training and inference of early-exit large language models (LLMs). While recent works have shown preliminary evidence for the efficacy of early exiting in accelerating LLM inference, EE-LLM makes a foundational step towards scaling up early-exit LLMs by supporting their training and inference with massive 3D parallelism. Built upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and performance optimizations tailored to early exiting, including a lightweight method that facilitates backpropagation for the early-exit training objective with pipeline parallelism, techniques of leveraging idle resources in the original pipeline schedule for computation related to early-exit layers, and two approaches of early-exit inference that are compatible with KV caching for autoregressive generation. Our analytical and empirical study shows that EE-LLM achieves great training efficiency with negligible computational overhead compared to standard LLM training, as well as outstanding inference speedup without compromising output quality. To facilitate further research and adoption, we release EE-LLM at https://github.com/pan-x-c/EE-LLM.

EE-LLM: 3D並列処理を用いた大規模な早期終了型大規模言語モデルの学習と推論

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

要旨

Support