尾巴的故事：模型崩潰作為尺度定律的改變

摘要

隨著人工智慧模型的尺寸增長，神經網絡的尺度定律已成為一個關鍵工具，用於預測大型模型在增加容量和原始（人類或自然）訓練數據的規模時的改進。然而，流行模型的廣泛使用意味著在線數據和文本的生態系統將逐漸包含越來越多的合成數據。在本文中，我們探討：當合成數據進入訓練語料庫時，尺度定律將如何改變？未來的模型仍將改進，還是注定將退化至完全（模型）崩潰？我們通過尺度定律的觀點發展了一個模型崩潰的理論框架。我們發現了廣泛的衰變現象，分析了尺度損失、隨世代數量變化的偏移尺度、技能的“反學習”以及在混合人類和合成數據時的理解。我們的理論得到了通過對一個算術任務上的變壓器和使用大型語言模型Llama2進行文本生成的大規模實驗的驗證。

English

As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the large language model Llama2.

尾巴的故事：模型崩潰作為尺度定律的改變

A Tale of Tails: Model Collapse as a Change of Scaling Laws

摘要

Support