模型痴呆：生成数据让模型遗忘

摘要

稳定扩散技术彻底改变了从描述性文本生成图像的方法。GPT-2、GPT-3(.5) 和 GPT-4 在各种语言任务中展现出惊人的性能。ChatGPT 将这类语言模型引入了普通大众。现在可以明确的是，大型语言模型 (LLMs) 已经扎根，并将在在线文本和图像整个生态系统中带来巨大变革。本文考虑了未来可能的发展。当大型语言模型 (LLMs) 占据在线文本的大部分内容时，GPT-{n} 会发生什么？我们发现，在训练中使用模型生成的内容会导致生成的模型出现不可逆的缺陷，原始内容分布的尾部会消失。我们称之为模型痴呆效应，并展示了它在变分自动编码器 (VAEs)、高斯混合模型 (GMMs) 和大型语言模型 (LLMs) 中的出现。我们在现象背后建立了理论直觉，并描绘了它在所有学习生成模型中的普遍性。我们证明，如果我们要继续从网络大规模数据训练中获益，就必须认真对待这一问题。事实上，在互联网抓取的数据中，由大型语言模型生成的内容存在时，关于人类与系统的真实互动收集的数据价值将日益增长。

English

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We call this effect model dementia and show that it can occur in Variational Autoencoders (VAEs), Gaussian Mixture Models (GMMs) and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

模型痴呆：生成数据让模型遗忘

Model Dementia: Generated Data Makes Models Forget

摘要

Support