大型语言模型在预训练过程中如何获取事实知识?
How Do Large Language Models Acquire Factual Knowledge During Pretraining?
June 17, 2024
作者: Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, Minjoon Seo
cs.AI
摘要
尽管最近观察到大型语言模型(LLMs)可以存储大量事实知识,但对它们通过预训练获取事实知识的机制了解有限。本研究通过研究LLMs在预训练过程中获取事实知识的方式来填补这一空白。研究结果揭示了关于预训练过程中事实知识获取动态的几个重要见解。首先,令人意外的是,我们观察到在更多数据上进行预训练并没有显著改善模型获取和保持事实知识的能力。接下来,训练步骤与事实知识的记忆和泛化遗忘之间存在幂律关系,使用重复训练数据训练的LLMs表现出更快的遗忘速度。第三,使用更大的批量大小训练LLMs可以增强模型对遗忘的稳健性。总体而言,我们的观察表明,LLMs在预训练中获取事实知识是通过逐步增加每一步中预训练数据中呈现的事实知识的概率实现的。然而,这种增加会被随后的遗忘所稀释。基于这一解释,我们证明可以对LLMs最近观察到的行为提供合理解释,例如LLMs在长尾知识上表现不佳以及去重预训练语料库的好处。
English
Despite the recent observation that large language models (LLMs) can store
substantial factual knowledge, there is a limited understanding of the
mechanisms of how they acquire factual knowledge through pretraining. This work
addresses this gap by studying how LLMs acquire factual knowledge during
pretraining. The findings reveal several important insights into the dynamics
of factual knowledge acquisition during pretraining. First, counterintuitively,
we observe that pretraining on more data shows no significant improvement in
the model's capability to acquire and maintain factual knowledge. Next, there
is a power-law relationship between training steps and forgetting of
memorization and generalization of factual knowledge, and LLMs trained with
duplicated training data exhibit faster forgetting. Third, training LLMs with
larger batch sizes can enhance the models' robustness to forgetting. Overall,
our observations suggest that factual knowledge acquisition in LLM pretraining
occurs by progressively increasing the probability of factual knowledge
presented in the pretraining data at each step. However, this increase is
diluted by subsequent forgetting. Based on this interpretation, we demonstrate
that we can provide plausible explanations for recently observed behaviors of
LLMs, such as the poor performance of LLMs on long-tail knowledge and the
benefits of deduplicating the pretraining corpus.Summary
AI-Generated Summary