大規模言語モデルの事前学習におけるデータ時間性の影響の理解

要旨

大規模言語モデル（LLM）は通常、シャッフルされたコーパスで訓練され、その結果、モデルの知識は訓練時に固定され、時間的な根拠（temporal grounding）は十分に理解されていない。本研究では、データの順序付けに特に焦点を当て、事前学習の動態が時間に敏感な事実知識の獲得に与える影響を調査する。主な貢献は2つある。第一に、7,000以上の時間的に根拠付けられた質問からなる包括的なベンチマークと、モデルが事実を対応する時間帯に正しく関連付けているかを分析可能な評価プロトコルを導入する。第二に、時間順に整列されたCommon Crawlスナップショットを用いて6Bパラメータのモデルを事前学習し、標準的なシャッフル事前学習と比較する。実験結果から、逐次学習されたモデルは、一般的な言語理解と共通知識においてシャッフルベースラインと同等の性能を示しつつ、一貫してより最新かつ時間的に正確な知識を持つことが示された。時間順事前学習は事実の新鮮さを向上させる一方、シャッフル事前学習は事実の繰り返しが多い可能性から古いデータでピーク性能を示す。これらの発見、ならびにコード（https://github.com/kyutai-labs/kairos）、チェックポイント、データセット（https://huggingface.co/collections/kyutai/kairos）の公開は、LLMの継続学習に関する今後の研究の基盤を提供する。

English

Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, focusing specifically on data ordering. Our main contributions are twofold. First, we introduce a comprehensive benchmark of over 7,000 temporally grounded questions and an evaluation protocol that enables analysis of whether models correctly associate facts with their corresponding time periods. Second, we pretrain 6B-parameter models on temporally ordered Common Crawl snapshots and compare them against standard shuffled pre-training. Our results show that sequentially trained models match shuffled baselines on general language understanding and common knowledge while consistently exhibiting more up-to-date and temporally precise knowledge. Temporally ordered pre-training yields improved factual freshness, while shuffled pre-training peaks on older data, possibly due to increased factual repetition. These findings, along with the release of our code at https://github.com/kyutai-labs/kairos , checkpoints, and datasets at https://huggingface.co/collections/kyutai/kairos provide a foundation for future research on continual learning for LLMs.