데이터 시간성이 대규모 언어 모델 사전 학습에 미치는 영향 이해

초록

대규모 언어 모델(LLM)은 일반적으로 무작위로 섞인 코퍼스에서 학습되며, 이로 인해 모델의 지식은 학습 시점에 고정되고 시간적 근거는 제대로 이해되지 않은 상태로 남는다. 본 연구에서는 데이터 순서에 초점을 맞춰, 사전 학습 동학이 시간에 민감한 사실 지식의 획득에 미치는 영향을 분석한다. 주요 기여는 두 가지다. 첫째, 7,000개 이상의 시간적 근거를 가진 질문으로 구성된 포괄적인 벤치마크와, 모델이 사실을 해당 기간과 올바르게 연관짓는지 분석할 수 있는 평가 프로토콜을 도입한다. 둘째, 시간 순서대로 정렬된 Common Crawl 스냅샷에서 6B 파라미터 모델을 사전 학습시키고, 표준적인 무작위 섞기 사전 학습과 비교한다. 실험 결과, 순차적으로 학습된 모델은 일반 언어 이해와 상식에서 무작위 섞기 기준선과 동등한 성능을 보이면서도, 지속적으로 더 최신에 가깝고 시간적으로 정확한 지식을 나타냈다. 시간 순서 사전 학습은 사실 최신성을 향상시킨 반면, 무작위 섞기 사전 학습은 아마도 사실 반복 증가로 인해 더 오래된 데이터에서 성능이 최고치를 기록했다. 이러한 발견과 함께, 코드(https://github.com/kyutai-labs/kairos), 체크포인트 및 데이터세트(https://huggingface.co/collections/kyutai/kairos)를 공개함으로써 LLM을 위한 지속적 학습에 관한 후속 연구의 기반을 마련한다.

English

Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up-to-date and temporally precise knowledge. Temporally ordered pre-training yields improved factual freshness, while shuffled pre-training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at https://github.com/kyutai-labs/kairos , checkpoints, and datasets at https://huggingface.co/collections/kyutai/kairos provide a foundation for future research on continual learning for LLMs.