大型语言模型下游任务性能的规模定律
Scaling Laws for Downstream Task Performance of Large Language Models
February 6, 2024
作者: Berivan Isik, Natalia Ponomareva, Hussein Hazimeh, Dimitris Paparas, Sergei Vassilvitskii, Sanmi Koyejo
cs.AI
摘要
规模定律提供了重要见解,可指导大型语言模型(LLMs)的设计。现有研究主要集中在研究预训练(上游)损失的规模定律。然而,在迁移学习设置中,LLMs通常会在无监督数据集上进行预训练,然后在下游任务上进行微调,我们也关心下游性能。在这项工作中,我们研究了迁移学习设置中的规模行为,其中LLMs被微调用于机器翻译任务。具体而言,我们调查了预训练数据的选择及其规模如何影响下游性能(翻译质量),评估标准为下游交叉熵和BLEU分数两个指标。我们的实验表明,微调数据集的规模和预训练数据与下游数据的分布对规模行为有显著影响。在充分对齐的情况下,随着更多的预训练数据,下游交叉熵和BLEU分数均呈单调改善趋势。在这种情况下,我们展示了可以使用对数定律准确预测下游BLEU分数的可能性。然而,也存在一些情况,适度的不对齐会导致BLEU分数随着更多的预训练而波动或变差,而下游交叉熵则单调改善。通过分析这些观察结果,我们为选择适当的预训练数据提供了新的实用见解。
English
Scaling laws provide important insights that can guide the design of large
language models (LLMs). Existing work has primarily focused on studying scaling
laws for pretraining (upstream) loss. However, in transfer learning settings,
in which LLMs are pretrained on an unsupervised dataset and then finetuned on a
downstream task, we often also care about the downstream performance. In this
work, we study the scaling behavior in a transfer learning setting, where LLMs
are finetuned for machine translation tasks. Specifically, we investigate how
the choice of the pretraining data and its size affect downstream performance
(translation quality) as judged by two metrics: downstream cross-entropy and
BLEU score. Our experiments indicate that the size of the finetuning dataset
and the distribution alignment between the pretraining and downstream data
significantly influence the scaling behavior. With sufficient alignment, both
downstream cross-entropy and BLEU score improve monotonically with more
pretraining data. In such cases, we show that it is possible to predict the
downstream BLEU score with good accuracy using a log-law. However, there are
also cases where moderate misalignment causes the BLEU score to fluctuate or
get worse with more pretraining, whereas downstream cross-entropy monotonically
improves. By analyzing these observations, we provide new practical insights
for choosing appropriate pretraining data.