理解数据时间性对大语言模型预训练的影响

摘要

大型语言模型（LLMs）通常在打乱语料上进行训练，导致模型的知识在训练时固化，其时间锚定性仍未被充分理解。本研究聚焦于数据排序，探讨预训练动态对时间敏感事实知识获取的影响。主要贡献有两个方面：首先，我们引入了一个包含超过7000个时间锚定问题的综合基准，以及一套评估协议，能够分析模型是否正确地将事实与对应的时间段关联起来。其次，我们在按时间顺序排列的Common Crawl快照上预训练了60亿参数模型，并将其与标准打乱预训练模型进行对比。结果表明，按时间顺序训练的模型在通用语言理解和常识知识方面与打乱基线模型相当，同时始终展现出更及时、更精确的时间相关知识。按时间顺序预训练能提升事实的新鲜度，而打乱预训练则倾向于在较旧数据上表现更佳，这可能归因于事实重复率的增加。这些发现，连同我们在https://github.com/kyutai-labs/kairos 上发布的代码、以及在https://huggingface.co/collections/kyutai/kairos 上发布的检查点和数据集，为LLMs持续学习的未来研究奠定了基础。

English

Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up-to-date and temporally precise knowledge. Temporally ordered pre-training yields improved factual freshness, while shuffled pre-training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at https://github.com/kyutai-labs/kairos , checkpoints, and datasets at https://huggingface.co/collections/kyutai/kairos provide a foundation for future research on continual learning for LLMs.