模型失智：生成數據讓模型遺忘

摘要

穩定擴散從描述性文本中革新了圖像創建。GPT-2、GPT-3(.5) 和 GPT-4 在各種語言任務中展現了驚人的表現。ChatGPT 將這些語言模型引入了普通大眾。現在清楚地看到，大型語言模型（LLMs）已經來臨，將在線文本和圖像的整個生態系統中帶來重大變化。在本文中，我們考慮未來可能會出現的情況。當LLMs貢獻了網絡上大部分語言時，GPT-{n} 會發生什麼變化？我們發現，在訓練中使用模型生成的內容會導致生成的模型出現不可逆的缺陷，原始內容分佈的尾部消失。我們稱這種效應為模型失智症，並展示它可能發生在變分自編碼器（VAEs）、高斯混合模型（GMMs）和LLMs中。我們建立了這種現象背後的理論直覺，並描述了它在所有學習生成模型中的普遍性。我們展示，如果我們要維持從網絡上爬取的大規模數據訓練的好處，就必須嚴肅對待這一問題。事實上，在網絡爬取的數據中，由LLMs生成的內容存在時，對系統與真實人類互動收集的數據的價值將會越來越重要。

English

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We call this effect model dementia and show that it can occur in Variational Autoencoders (VAEs), Gaussian Mixture Models (GMMs) and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

模型失智：生成數據讓模型遺忘

Model Dementia: Generated Data Makes Models Forget

摘要

Support