문맥 내 학습 이해를 위한 지원적 사전 학습 데이터

초록

컨텍스트 내 학습(In-context learning, ICL)은 추론 시 몇 가지 예시를 단순히 제시함으로써 다양한 자연어 처리(NLP) 작업에서 언어 모델의 성능을 향상시킨다. ICL 능력이 왜 나타나는지는 잘 이해되지 않고 있는데, 이는 모델이 그러한 데모에 대해 특별히 훈련된 적이 없기 때문이다. ICL의 암묵적 메커니즘을 탐구한 기존 연구와 달리, 우리는 사전 훈련 데이터를 조사함으로써 ICL을 연구한다. 구체적으로, 우리는 먼저 ICL을 지원하는 사전 훈련 데이터의 작은 부분집합을 찾기 위해 반복적이고 경사 기반의 접근 방식을 적용한다. 이 작은 부분집합에 대한 추가 사전 훈련이 모델의 ICL 능력을 최대 18%까지 크게 향상시킨다는 것을 관찰한다. 그런 다음, 이 지원적인 부분집합을 사전 훈련 데이터의 무작위 부분집합과 대조적으로 비교하여 다음과 같은 사실을 발견한다: (1) ICL을 지원하는 사전 훈련 데이터는 다운스트림 작업과 더 높은 도메인 관련성을 가지지 않는다. (2) ICL을 지원하는 사전 훈련 데이터는 드물게 발생하는 롱테일 토큰의 비중이 더 높다. (3) ICL을 지원하는 사전 훈련 데이터는 장거리 컨텍스트에서의 정보 획득이 평균 이하인 어려운 예시들로, 어려운 장거리 컨텍스트를 통합하는 학습이 ICL을 촉진함을 나타낸다. 우리의 연구는 인스턴스 수준의 사전 훈련 데이터를 분석함으로써 ICL을 이해하기 위한 첫걸음을 내딛는다. 우리의 통찰은 향후 사전 훈련 데이터 구성을 적극적으로 안내함으로써 언어 모델의 ICL 능력을 강화할 잠재력을 가지고 있다.

English

In-context learning (ICL) improves language models' performance on a variety of NLP tasks by simply demonstrating a handful of examples at inference time. It is not well understood why ICL ability emerges, as the model has never been specifically trained on such demonstrations. Unlike prior work that explores implicit mechanisms behind ICL, we study ICL via investigating the pretraining data. Specifically, we first adapt an iterative, gradient-based approach to find a small subset of pretraining data that supports ICL. We observe that a continued pretraining on this small subset significantly improves the model's ICL ability, by up to 18%. We then compare the supportive subset constrastively with random subsets of pretraining data and discover: (1) The supportive pretraining data to ICL do not have a higher domain relevance to downstream tasks. (2) The supportive pretraining data have a higher mass of rarely occurring, long-tail tokens. (3) The supportive pretraining data are challenging examples where the information gain from long-range context is below average, indicating learning to incorporate difficult long-range context encourages ICL. Our work takes a first step towards understanding ICL via analyzing instance-level pretraining data. Our insights have a potential to enhance the ICL ability of language models by actively guiding the construction of pretraining data in the future.

문맥 내 학습 이해를 위한 지원적 사전 학습 데이터

Understanding In-Context Learning via Supportive Pretraining Data

초록

Support