通過支持性預訓練數據來理解上下文學習

摘要

在上下文學習（ICL）中，通過在推論時展示少量示例，可以提高語言模型在各種自然語言處理任務上的性能。ICL 能力的出現尚不清楚，因為模型從未專門接受過此類示範訓練。與探索 ICL 背後的隱式機制的先前工作不同，我們通過研究預訓練數據來研究 ICL。具體而言，我們首先適應了一種迭代的基於梯度的方法，找到支持 ICL 的少量預訓練數據子集。我們觀察到，對這個小子集進行持續的預訓練顯著提高了模型的 ICL 能力，最多可提高 18%。然後，我們將支持子集與預訓練數據的隨機子集進行對比，並發現：（1）支持 ICL 的預訓練數據與下游任務的領域相關性並不更高。（2）支持 ICL 的預訓練數據具有更多罕見的、長尾的標記。（3）支持 ICL 的預訓練數據是具有挑戰性的示例，其中從長距離上下文中獲得的信息增益低於平均水平，這表明學習將困難的長距離上下文納入其中有助於 ICL。我們的工作是朝著通過分析實例級預訓練數據來理解 ICL 邁出的第一步。我們的見解有潛力通過積極指導未來預訓練數據的構建來增強語言模型的 ICL 能力。

English

In-context learning (ICL) improves language models' performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time. It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations. Unlike prior work that explores implicit mechanisms behind ICL, we study ICL via investigating the pretraining data. Specifically, we first adapt an iterative, gradient-based approach to find a small subset of pretraining data that supports ICL. We observe that a continued pretraining on this small subset significantly improves the model's ICL ability, by up to 18%. We then compare the supportive subset constrastively with random subsets of pretraining data and discover: (1) The supportive pretraining data to ICL do not have a higher domain relevance to downstream tasks. (2) The supportive pretraining data have a higher mass of rarely occurring, long-tail tokens. (3) The supportive pretraining data are challenging examples where the information gain from long-range context is below average, indicating learning to incorporate difficult long-range context encourages ICL. Our work takes a first step towards understanding ICL via analyzing instance-level pretraining data. Our insights have a potential to enhance the ICL ability of language models by actively guiding the construction of pretraining data in the future.

通過支持性預訓練數據來理解上下文學習

Understanding In-Context Learning via Supportive Pretraining Data

摘要

Support