LLMにおける長い文脈拡張と一般化に関する制御された研究

要旨

広範なテキスト理解と文脈学習には、完全な文脈を活用する言語モデルが必要です。長い文脈モデルを直接トレーニングする際の実装上の課題から、多くの手法が提案されてきました。長い文脈を扱うためのモデルを拡張するための方法です。ただし、データとモデルクラスの違いにより、これらのアプローチを比較することが難しく、長い文脈のパフォーマンスを評価する方法や、標準的な評価との違いについて不確実性が生じています。我々は、拡張方法のための制御されたプロトコルを実装し、標準化された評価を行い、一貫したベースモデルと拡張データを活用しています。我々の研究は、長い文脈の振る舞いに関するいくつかの洞察をもたらしました。まず、長い文脈のタスクでも一般的なパフォーマンス指標としてパープレキシティの重要な役割を再確認しています。第二に、現在の近似アテンション手法が長い文脈のタスク全般で一貫してパフォーマンスが低いことを見つけました。最後に、正確なファインチューニングベースの手法が一般的に拡張範囲内で効果的であることを確認し、一方で外挿は依然として難しいことを確認しました。すべてのコードベース、モデル、およびチェックポイントはオープンソースで公開され、AI開発のこの重要な分野における透明性を促進し、さらなる研究を容易にします。

English

Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within the range of their extension, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.

LLMにおける長い文脈拡張と一般化に関する制御された研究

A Controlled Study on Long Context Extension and Generalization in LLMs

要旨

Support