Demência de Modelos: Dados Gerados Fazem os Modelos Esquecer

Resumo

O Stable Diffusion revolucionou a criação de imagens a partir de textos descritivos. O GPT-2, GPT-3(.5) e GPT-4 demonstraram desempenho impressionante em uma variedade de tarefas de linguagem. O ChatGPT introduziu esses modelos de linguagem ao público em geral. Agora está claro que os grandes modelos de linguagem (LLMs, na sigla em inglês) vieram para ficar e trarão mudanças drásticas em todo o ecossistema de textos e imagens online. Neste artigo, consideramos o que o futuro pode reservar. O que acontecerá com o GPT-{n} quando os LLMs contribuírem com grande parte da linguagem encontrada na internet? Descobrimos que o uso de conteúdo gerado por modelos no treinamento causa defeitos irreversíveis nos modelos resultantes, onde as caudas da distribuição original do conteúdo desaparecem. Chamamos esse efeito de "demência de modelo" e mostramos que ele pode ocorrer em Autoencoders Variacionais (VAEs), Modelos de Mistura Gaussiana (GMMs) e LLMs. Construímos uma intuição teórica por trás do fenômeno e destacamos sua ubiquidade entre todos os modelos generativos aprendidos. Demonstramos que ele deve ser levado a sério se quisermos sustentar os benefícios do treinamento com dados em grande escala extraídos da web. De fato, o valor dos dados coletados sobre interações genuínas de humanos com sistemas será cada vez mais valioso na presença de conteúdo gerado por LLMs em dados coletados da internet.

English

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs contribute much of the language found online? We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We call this effect model dementia and show that it can occur in Variational Autoencoders (VAEs), Gaussian Mixture Models (GMMs) and LLMs. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLMs in data crawled from the Internet.

Demência de Modelos: Dados Gerados Fazem os Modelos Esquecer

Model Dementia: Generated Data Makes Models Forget

Resumo

Support