大型语言模型的泛化复杂性量化

摘要

尽管大型语言模型（LLMs）展现出在理解复杂查询和执行复杂任务方面的卓越能力，但它们的泛化能力往往与记忆紧密交织，需要更精确的评估。为了解决这一挑战，我们引入了Scylla，一个动态评估框架，定量衡量LLMs的泛化能力。Scylla通过对模型在分布内（ID）和分布外（OOD）数据上的表现进行评估，涵盖了20个任务，跨越5个复杂级别，从而将泛化与记忆分离开来。通过大量实验，我们揭示了任务复杂性与ID和OOD数据之间性能差距之间的非单调关系，我们称之为泛化谷。具体而言，这一现象揭示了一个关键阈值 - 称为关键复杂性 - 在这一阈值上，非泛化行为的依赖达到峰值，表明了LLMs泛化能力的上限。随着模型规模的增加，关键复杂性向更高级别的任务复杂性转移，这表明更大的模型可以在过度依赖记忆之前处理更复杂的推理任务。利用Scylla和关键复杂性的概念，我们对28个LLMs进行基准测试，包括开源模型如LLaMA和Qwen家族，以及闭源模型如Claude和GPT，提供更强大的评估，并建立对LLMs泛化能力的更清晰理解。

English

While large language models (LLMs) have shown exceptional capabilities in understanding complex queries and performing sophisticated tasks, their generalization abilities are often deeply entangled with memorization, necessitating more precise evaluation. To address this challenge, we introduce Scylla, a dynamic evaluation framework that quantitatively measures the generalization abilities of LLMs. Scylla disentangles generalization from memorization via assessing model performance on both in-distribution (ID) and out-of-distribution (OOD) data through 20 tasks across 5 levels of complexity. Through extensive experiments, we uncover a non-monotonic relationship between task complexity and the performance gap between ID and OOD data, which we term the generalization valley. Specifically, this phenomenon reveals a critical threshold - referred to as critical complexity - where reliance on non-generalizable behavior peaks, indicating the upper bound of LLMs' generalization capabilities. As model size increases, the critical complexity shifts toward higher levels of task complexity, suggesting that larger models can handle more complex reasoning tasks before over-relying on memorization. Leveraging Scylla and the concept of critical complexity, we benchmark 28LLMs including both open-sourced models such as LLaMA and Qwen families, and close-sourced models like Claude and GPT, providing a more robust evaluation and establishing a clearer understanding of LLMs' generalization capabilities.

大型语言模型的泛化复杂性量化

Quantifying Generalization Complexity for Large Language Models

摘要

Support