尾巴的故事：模型坍缩作为尺度定律的变化

摘要

随着人工智能模型规模的增长，神经网络缩放定律已成为一种关键工具，用于预测大型模型在增加容量和原始（人类或自然）训练数据规模时的改进。然而，流行模型的广泛使用意味着在线数据和文本的生态系统将逐渐包含更多合成数据。本文探讨了一个问题：在合成数据不可避免地进入训练语料库的情况下，缩放定律会如何变化？未来的模型会继续改进，还是注定会退化甚至完全崩溃？我们通过缩放定律的视角构建了一个模型崩溃的理论框架。我们发现了各种衰减现象，分析了缩放丧失、随着世代数量的变化而发生的缩放偏移、技能的“反学习”以及在混合人类和合成数据时的洞察。我们通过对一个算术任务上的变压器和使用大型语言模型Llama2进行的文本生成的大规模实验验证了我们的理论。

English

As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the large language model Llama2.

尾巴的故事：模型坍缩作为尺度定律的变化

A Tale of Tails: Model Collapse as a Change of Scaling Laws

摘要

Support