排序LLaMA:使用排序微調(SoFT)釋放大型語言模型中間層的潛力,以進行動態推論
Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT)
September 16, 2023
作者: Parsa Kavehzadeh, Mojtaba Valipour, Marzieh Tahaei, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh
cs.AI
摘要
大型語言模型(LLMs)的快速發展已經徹底改變了自然語言處理(NLP)。儘管這些模型擅長於理解和生成類似人類的文本,但它們的廣泛部署可能成本過高。SortedNet是一種用於實現深度神經網絡動態推斷的最新訓練技術。它利用網絡模塊化來創建具有不同計算負載的子模型,並根據計算/準確性特徵以嵌套方式對其進行排序。我們將SortedNet擴展到生成式NLP任務,使大型語言模型在不需要任何預訓練的情況下變得動態,並僅以相同成本將標準監督微調(SFT)替換為Sorted Fine-Tuning(SoFT)。我們的方法提高了模型的效率,消除了在推斷過程中為各種情況使用多個模型的需要。我們展示了通過使用這種方法,我們能夠發揮變壓器中間層在生成目標輸出方面的潛力。我們的子模型仍然是原始模型的重要組成部分,最大程度地減少了存儲需求和在不同計算/延遲預算之間的過渡成本。通過將此方法應用於LLaMa 2 13B,對於在Stanford Alpaca數據集上進行調整並將其與正常調整和通過PandaLM基準提前退出進行比較,我們展示了Sorted Fine-Tuning可以以兩倍於原始模型的速度交付模型,同時保持或超越性能。
English
The rapid advancement of large language models (LLMs) has revolutionized
natural language processing (NLP). While these models excel at understanding
and generating human-like text, their widespread deployment can be
prohibitively expensive. SortedNet is a recent training technique for enabling
dynamic inference for deep neural networks. It leverages network modularity to
create sub-models with varying computational loads, sorting them based on
computation/accuracy characteristics in a nested manner. We extend SortedNet to
generative NLP tasks, making large language models dynamic without any
pretraining and by only replacing standard Supervised Fine-Tuning (SFT) with
Sorted Fine-Tuning (SoFT) at the same costs. Our approach boosts model
efficiency, eliminating the need for multiple models for various scenarios
during inference. We show that using this approach, we are able to unlock the
potential of intermediate layers of transformers in generating the target
output. Our sub-models remain integral components of the original model,
minimizing storage requirements and transition costs between different
computational/latency budgets. By applying this approach on LLaMa 2 13B for
tuning on the Stanford Alpaca dataset and comparing it to normal tuning and
early exit via PandaLM benchmark, we show that Sorted Fine-Tuning can deliver
models twice as fast as the original model while maintaining or exceeding
performance.