Over de "Inductievooringenomenheid" in Sequentiële Modellen

Samenvatting

Ondanks het opmerkelijke praktische succes van transformer-gebaseerde taalmodelen, heeft recent onderzoek vraagtekens gezet bij hun vermogen om toestandsbijhouding (state tracking) uit te voeren. Met name een groeiende hoeveelheid literatuur heeft deze beperking vooral aangetoond door mislukkingen in out-of-distribution (OOD) generalisatie, zoals lengte-extrapolatie. In dit werk richten we de aandacht op de in-distribution implicaties van deze beperkingen. We voeren een grootschalige experimentele studie uit naar de data-efficiëntie van transformers en recurrent neural networks (RNN's) over verschillende vormen van supervisie. We constateren dat de hoeveelheid trainingsdata die transformers nodig hebben, veel sneller toeneemt met de grootte van de toestandsruimte en de sequentielengte dan bij RNN's. Verder analyseren we in hoeverre geleerde mechanismen voor toestandsbijhouding worden gedeeld over verschillende sequentielengtes. We tonen aan dat transformers verwaarloosbare of zelfs schadelijke gewichtsdeling over lengtes vertonen, wat erop wijst dat ze lengte-specifieke oplossingen in isolatie leren. Recurrente modellen daarentegen vertonen effectief geamortiseerd leren door gewichten over lengtes te delen, waardoor data van de ene sequentielengte de prestaties op andere kan verbeteren. Samen tonen deze resultaten aan dat toestandsbijhouding een fundamentele uitdaging voor transformers blijft, zelfs wanneer de trainings- en evaluatiedistributies overeenkomen.

English

Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking. In particular, a growing body of literature has shown this limitation primarily through failures in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We conduct a large-scale experimental study of the data efficiency of transformers and recurrent neural networks (RNNs) across multiple supervision regimes. We find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths, allowing data from one sequence length to improve performance on others. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.

Over de "Inductievooringenomenheid" in Sequentiële Modellen

On the "Induction Bias" in Sequence Models

Samenvatting

Support