當擴展遇上LLM微調：資料、模型和微調方法的影響

摘要

儘管大型語言模型（LLMs）通常採用微調以發揮其在下游應用中的能力，但我們對於不同微調方法的歸納偏好（尤其是規模特性）的理解仍然有限。為了填補這一空白，我們進行系統性實驗，研究不同縮放因素（包括LLM模型大小、預訓練數據大小、新微調參數大小和微調數據大小）如何影響微調性能。我們考慮兩種微調類型--完整模型微調（FMT）和參數高效微調（PET，包括提示微調和LoRA），並探索它們在數據有限情況下的縮放行為，其中LLM模型大小遠遠超過微調數據大小。基於兩組從1B到16B的預訓練雙語LLMs以及對雙語機器翻譯和多語總結基準的實驗，我們發現：1）LLM微調遵循一種基於冪的乘法聯合縮放定律，介於微調數據大小和其他每個縮放因素之間；2）LLM微調更多地受益於LLM模型縮放而不是預訓練數據縮放，而PET參數縮放通常無效；3）最佳微調方法高度取決於任務和微調數據。我們希望我們的研究結果能夠幫助理解、選擇和發展LLM微調方法。

English

While large language models (LLMs) often adopt finetuning to unlock their capabilities for downstream applications, our understanding on the inductive biases (especially the scaling properties) of different finetuning methods is still limited. To fill this gap, we conduct systematic experiments studying whether and how different scaling factors, including LLM model size, pretraining data size, new finetuning parameter size and finetuning data size, affect the finetuning performance. We consider two types of finetuning -- full-model tuning (FMT) and parameter efficient tuning (PET, including prompt tuning and LoRA), and explore their scaling behaviors in the data-limited regime where the LLM model size substantially outweighs the finetuning data size. Based on two sets of pretrained bilingual LLMs from 1B to 16B and experiments on bilingual machine translation and multilingual summarization benchmarks, we find that 1) LLM finetuning follows a powerbased multiplicative joint scaling law between finetuning data size and each other scaling factor; 2) LLM finetuning benefits more from LLM model scaling than pretraining data scaling, and PET parameter scaling is generally ineffective; and 3) the optimal finetuning method is highly task- and finetuning data-dependent. We hope our findings could shed light on understanding, selecting and developing LLM finetuning methods.

當擴展遇上LLM微調：資料、模型和微調方法的影響

When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method

摘要

Support