文脈内学習を支援する事前学習データによる理解

要旨

文脈内学習（ICL）は、推論時に少数の例を示すだけで、言語モデルの様々なNLPタスクにおける性能を向上させます。ICL能力がなぜ発現するのかは十分に理解されておらず、モデルはそのようなデモンストレーションに対して特別に訓練されたわけではありません。これまでの研究がICLの背後にある暗黙のメカニズムを探求してきたのとは異なり、我々は事前学習データを調査することでICLを研究します。具体的には、まず、ICLをサポートする事前学習データの小さなサブセットを見つけるために、反復的で勾配ベースのアプローチを適用します。この小さなサブセットでの継続的な事前学習が、モデルのICL能力を最大18%向上させることを観察します。次に、このサポート的なサブセットを、事前学習データのランダムなサブセットと対照的に比較し、以下のことを発見します：（1）ICLをサポートする事前学習データは、下流タスクに対するドメイン関連性が高いわけではありません。（2）ICLをサポートする事前学習データは、稀に出現するロングテールのトークンの割合が高いです。（3）ICLをサポートする事前学習データは、長距離コンテキストからの情報利得が平均以下である挑戦的な例であり、難しい長距離コンテキストを組み込む学習がICLを促進することを示しています。我々の研究は、インスタンスレベルの事前学習データを分析することでICLを理解するための第一歩を踏み出しました。我々の洞察は、将来の事前学習データの構築を積極的に導くことで、言語モデルのICL能力を向上させる可能性を秘めています。

English

In-context learning (ICL) improves language models' performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time. It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations. Unlike prior work that explores implicit mechanisms behind ICL, we study ICL via investigating the pretraining data. Specifically, we first adapt an iterative, gradient-based approach to find a small subset of pretraining data that supports ICL. We observe that a continued pretraining on this small subset significantly improves the model's ICL ability, by up to 18%. We then compare the supportive subset constrastively with random subsets of pretraining data and discover: (1) The supportive pretraining data to ICL do not have a higher domain relevance to downstream tasks. (2) The supportive pretraining data have a higher mass of rarely occurring, long-tail tokens. (3) The supportive pretraining data are challenging examples where the information gain from long-range context is below average, indicating learning to incorporate difficult long-range context encourages ICL. Our work takes a first step towards understanding ICL via analyzing instance-level pretraining data. Our insights have a potential to enhance the ICL ability of language models by actively guiding the construction of pretraining data in the future.

文脈内学習を支援する事前学習データによる理解

Understanding In-Context Learning via Supportive Pretraining Data

要旨

Support