Scoprire algoritmi di mesa-ottimizzazione nei Transformer

Abstract

I Transformer sono diventati il modello dominante nel deep learning, ma le ragioni della loro superiorità prestazionale sono poco comprese. Qui ipotizziamo che le elevate prestazioni dei Transformer derivino da un bias architetturale verso la mesa-ottimizzazione, un processo appreso che opera durante il forward pass di un modello e consiste nei seguenti due passaggi: (i) la costruzione di un obiettivo di apprendimento interno e (ii) la sua soluzione corrispondente trovata attraverso l'ottimizzazione. Per testare questa ipotesi, abbiamo analizzato in modo inverso una serie di Transformer autoregressivi addestrati su semplici task di modellazione di sequenze, scoprendo algoritmi di mesa-ottimizzazione basati su gradienti che guidano la generazione delle previsioni. Inoltre, dimostriamo che l'algoritmo di ottimizzazione appreso durante il forward pass può essere immediatamente riutilizzato per risolvere task supervisionati few-shot, suggerendo che la mesa-ottimizzazione potrebbe essere alla base delle capacità di apprendimento in-context dei grandi modelli linguistici. Infine, proponiamo un nuovo livello di self-attention, il mesa-layer, che risolve in modo esplicito ed efficiente problemi di ottimizzazione specificati nel contesto. Riscontriamo che questo livello può portare a miglioramenti nelle prestazioni in esperimenti sintetici e preliminari di modellazione linguistica, rafforzando la nostra ipotesi che la mesa-ottimizzazione sia un'operazione importante nascosta nei pesi dei Transformer addestrati.

English

Transformers have become the dominant model in deep learning, but the reason for their superior performance is poorly understood. Here, we hypothesize that the strong performance of Transformers stems from an architectural bias towards mesa-optimization, a learned process running within the forward pass of a model consisting of the following two steps: (i) the construction of an internal learning objective, and (ii) its corresponding solution found through optimization. To test this hypothesis, we reverse-engineer a series of autoregressive Transformers trained on simple sequence modeling tasks, uncovering underlying gradient-based mesa-optimization algorithms driving the generation of predictions. Moreover, we show that the learned forward-pass optimization algorithm can be immediately repurposed to solve supervised few-shot tasks, suggesting that mesa-optimization might underlie the in-context learning capabilities of large language models. Finally, we propose a novel self-attention layer, the mesa-layer, that explicitly and efficiently solves optimization problems specified in context. We find that this layer can lead to improved performance in synthetic and preliminary language modeling experiments, adding weight to our hypothesis that mesa-optimization is an important operation hidden within the weights of trained Transformers.

Scoprire algoritmi di mesa-ottimizzazione nei Transformer

Uncovering mesa-optimization algorithms in Transformers

Abstract

Support