大規模言語モデルに対する汎化複雑性の定量化

要旨

大規模言語モデル（LLMs）は、複雑なクエリの理解や高度なタスクの実行に優れた能力を示していますが、その一般化能力はしばしば記憶と深く絡み合い、より正確な評価が必要とされます。この課題に対処するために、私たちはScyllaを導入します。Scyllaは、LLMsの一般化能力を定量的に測定する動的評価フレームワークです。Scyllaは、20のタスクを通じて5つの複雑さレベルで、分布内（ID）および分布外（OOD）データでモデルのパフォーマンスを評価することによって、一般化と記憶を切り離します。多くの実験を通じて、タスクの複雑さとIDとOODデータのパフォーマンス差との間に非単調な関係があることを明らかにしました。この関係を一般化の谷と呼びます。具体的には、この現象は、非一般化行動への依存がピークに達する臨界複雑さと呼ばれる重要なしきい値を示し、LLMsの一般化能力の上限を示唆しています。モデルのサイズが大きくなるにつれて、臨界複雑さがより高いタスクの複雑さにシフトし、大きなモデルが記憶に過度に依存する前により複雑な推論タスクを処理できることを示唆しています。Scyllaと臨界複雑さの概念を活用して、LLMsの一般化能力をより堅牢に評価し、LLaMAやQwenファミリーなどのオープンソースモデルとClaudeやGPTなどのクローズドソースモデルを含む28つのLLMsをベンチマークとして提供し、LLMsの一般化能力についてより明確な理解を確立します。

English

While large language models (LLMs) have shown exceptional capabilities in understanding complex queries and performing sophisticated tasks, their generalization abilities are often deeply entangled with memorization, necessitating more precise evaluation. To address this challenge, we introduce Scylla, a dynamic evaluation framework that quantitatively measures the generalization abilities of LLMs. Scylla disentangles generalization from memorization via assessing model performance on both in-distribution (ID) and out-of-distribution (OOD) data through 20 tasks across 5 levels of complexity. Through extensive experiments, we uncover a non-monotonic relationship between task complexity and the performance gap between ID and OOD data, which we term the generalization valley. Specifically, this phenomenon reveals a critical threshold - referred to as critical complexity - where reliance on non-generalizable behavior peaks, indicating the upper bound of LLMs' generalization capabilities. As model size increases, the critical complexity shifts toward higher levels of task complexity, suggesting that larger models can handle more complex reasoning tasks before over-relying on memorization. Leveraging Scylla and the concept of critical complexity, we benchmark 28LLMs including both open-sourced models such as LLaMA and Qwen families, and close-sourced models like Claude and GPT, providing a more robust evaluation and establishing a clearer understanding of LLMs' generalization capabilities.

大規模言語モデルに対する汎化複雑性の定量化

Quantifying Generalization Complexity for Large Language Models

要旨

Support