大型語言模型在預訓練期間如何獲得事實知識?
How Do Large Language Models Acquire Factual Knowledge During Pretraining?
June 17, 2024
作者: Hoyeon Chang, Jinho Park, Seonghyeon Ye, Sohee Yang, Youngkyung Seo, Du-Seong Chang, Minjoon Seo
cs.AI
摘要
儘管最近觀察到大型語言模型(LLMs)能夠儲存大量事實知識,對於它們如何通過預訓練獲取事實知識的機制仍知之甚少。本研究填補了這一空白,研究了LLMs在預訓練期間如何獲取事實知識。研究結果揭示了在預訓練期間獲取事實知識的動態過程中的幾個重要見解。首先,出乎意料地,我們觀察到在更多數據上的預訓練並未顯著提高模型獲取和保持事實知識的能力。接下來,訓練步驟與遺忘記憶和事實知識泛化之間存在冪律關係,使用重複訓練數據訓練的LLMs表現出更快的遺忘速度。第三,使用更大的批次大小訓練LLMs可以增強模型對遺忘的抵抗力。總的來說,我們的觀察表明,在LLMs的預訓練中,事實知識的獲取是通過逐步增加每個步驟中預訓練數據中呈現的事實知識的概率而發生的。然而,這種增加會被後續的遺忘所稀釋。基於這一解釋,我們證明了我們可以對LLMs最近觀察到的行為提供合理的解釋,例如LLMs在長尾知識上表現不佳以及去重預訓練語料庫的好處。
English
Despite the recent observation that large language models (LLMs) can store
substantial factual knowledge, there is a limited understanding of the
mechanisms of how they acquire factual knowledge through pretraining. This work
addresses this gap by studying how LLMs acquire factual knowledge during
pretraining. The findings reveal several important insights into the dynamics
of factual knowledge acquisition during pretraining. First, counterintuitively,
we observe that pretraining on more data shows no significant improvement in
the model's capability to acquire and maintain factual knowledge. Next, there
is a power-law relationship between training steps and forgetting of
memorization and generalization of factual knowledge, and LLMs trained with
duplicated training data exhibit faster forgetting. Third, training LLMs with
larger batch sizes can enhance the models' robustness to forgetting. Overall,
our observations suggest that factual knowledge acquisition in LLM pretraining
occurs by progressively increasing the probability of factual knowledge
presented in the pretraining data at each step. However, this increase is
diluted by subsequent forgetting. Based on this interpretation, we demonstrate
that we can provide plausible explanations for recently observed behaviors of
LLMs, such as the poor performance of LLMs on long-tail knowledge and the
benefits of deduplicating the pretraining corpus.Summary
AI-Generated Summary