基盤モデルの長文脈スケーリングの効果的実現

要旨

最大32,768トークンまでの効果的なコンテキストウィンドウをサポートする一連の長文脈LLMを提案します。我々のモデルシリーズは、Llama 2を基盤に、より長いトレーニングシーケンスと長文テキストをアップサンプリングしたデータセットを用いて継続事前学習を行うことで構築されています。言語モデリング、合成コンテキストプロービングタスク、および幅広い研究ベンチマークにおいて詳細な評価を実施しました。研究ベンチマークでは、我々のモデルはほとんどの通常タスクで一貫した改善を示し、長文脈タスクではLlama 2を大幅に上回る結果を達成しました。特に、人間による注釈付きの長文指示データを必要としないコスト効率の良い指示チューニング手順により、70Bバリアントは既にgpt-3.5-turbo-16kの長文脈タスク全体の性能を凌駕しています。これらの結果に加えて、我々の手法の個々のコンポーネントについて詳細な分析を提供します。Llamaの位置エンコーディングについて掘り下げ、長い依存関係をモデル化する際のその限界について議論します。また、事前学習プロセスにおける様々な設計選択の影響を検証し、データミックスやシーケンス長のトレーニングカリキュラムを含めます。我々のアブレーション実験は、事前学習データセットに豊富な長文テキストを含むことが強力な性能を達成する鍵ではないことを示唆しており、長文脈の継続事前学習が長いシーケンスでゼロから事前学習を行うよりも効率的で同様に効果的であることを経験的に検証しました。

English

We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.

基盤モデルの長文脈スケーリングの効果的実現

Effective Long-Context Scaling of Foundation Models

要旨

Support