コンテクストを考慮したスケーリング則によるタスク性能の予測

要旨

スケーリング則は、クロスエントロピー損失などの上流指標を、モデルサイズ、トレーニングデータ、計算量などの設計要因と結びつけることで、大規模言語モデルの理解を革新してきました。しかし、これらの従来の法則は、文脈が重要な役割を果たす下流タスクの性能を捉えることができません。本研究では、トレーニング計算量と提供される文脈の関数として下流性能を共同でモデル化する、シンプルで解釈可能なフレームワークを提案します。このフレームワークを、Llama-2-7BおよびLlama-2-13Bの拡張文脈バリアントの下流性能を、算術推論、常識推論、機械翻訳の3つのタスクにわたる65,500のユニークなインスタンスで観測し、実証的に検証します。結果は、このフレームワークが分布内の下流性能を正確にモデル化し、トレーニング計算量の3桁にわたる範囲で一般化し、文脈量が増加するにつれて性能を信頼性高く外挿することを示しています。これらの知見は、トレーニング計算量と文脈利用の相互作用に関する貴重な洞察を提供し、多様な下流タスクのための効率的な長文脈LLMの設計に指針を与えます。コードはhttps://github.com/wang-research-lab/context-scalingで公開されています。

English

Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at https://github.com/wang-research-lab/context-scaling.

コンテクストを考慮したスケーリング則によるタスク性能の予測

Predicting Task Performance with Context-aware Scaling Laws

要旨

Support