通过支持性预训练数据理解上下文学习

摘要

在上下文学习（ICL）中，通过在推断时简单展示少量示例，可以提高语言模型在各种自然语言处理任务上的性能。ICL能力为何出现尚不为人所了解，因为模型从未专门接受过此类示范训练。与探索ICL背后的隐含机制的先前工作不同，我们通过调查预训练数据来研究ICL。具体而言，我们首先采用迭代的基于梯度的方法，找到支持ICL的预训练数据的一个小子集。我们观察到，对这个小子集进行持续的预训练显著提高了模型的ICL能力，最多可提高18%。然后，我们将支持子集与预训练数据的随机子集进行对比，并发现：（1）支持ICL的预训练数据与下游任务的领域相关性并不更高。（2）支持ICL的预训练数据中包含更多罕见的、长尾的标记。（3）支持ICL的预训练数据是具有挑战性的示例，其中来自长距离上下文的信息增益低于平均水平，表明学习如何整合困难的长距离上下文有助于ICL。我们的工作通过分析实例级别的预训练数据，迈出了理解ICL的第一步。我们的见解有潜力通过积极指导未来预训练数据的构建来增强语言模型的ICL能力。

English

In-context learning (ICL) improves language models' performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time. It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations. Unlike prior work that explores implicit mechanisms behind ICL, we study ICL via investigating the pretraining data. Specifically, we first adapt an iterative, gradient-based approach to find a small subset of pretraining data that supports ICL. We observe that a continued pretraining on this small subset significantly improves the model's ICL ability, by up to 18%. We then compare the supportive subset constrastively with random subsets of pretraining data and discover: (1) The supportive pretraining data to ICL do not have a higher domain relevance to downstream tasks. (2) The supportive pretraining data have a higher mass of rarely occurring, long-tail tokens. (3) The supportive pretraining data are challenging examples where the information gain from long-range context is below average, indicating learning to incorporate difficult long-range context encourages ICL. Our work takes a first step towards understanding ICL via analyzing instance-level pretraining data. Our insights have a potential to enhance the ICL ability of language models by actively guiding the construction of pretraining data in the future.

通过支持性预训练数据理解上下文学习

Understanding In-Context Learning via Supportive Pretraining Data

摘要

Support