Da Byte a Idee: Modellazione del Linguaggio con U-Net Autoregressive

Abstract

La tokenizzazione impone una granularità fissa sul testo di input, congelando il modo in cui un modello linguistico opera sui dati e quanto lontano nel futuro predice. Il Byte Pair Encoding (BPE) e schemi simili suddividono il testo una volta, costruiscono un vocabolario statico e lasciano il modello bloccato con quella scelta. Noi rilassiamo questa rigidità introducendo una U-Net autoregressiva che impara a incorporare i propri token durante l'addestramento. La rete legge byte grezzi, li raggruppa in parole, poi in coppie di parole, e fino a 4 parole, fornendole una visione multi-scala della sequenza. A livelli più profondi, il modello deve predire più avanti nel futuro — anticipando le prossime parole piuttosto che il prossimo byte — quindi i livelli più profondi si concentrano su schemi semantici più ampi, mentre i livelli iniziali gestiscono i dettagli fini. Quando si sintonizza e controlla attentamente il calcolo di pre-addestramento, le gerarchie superficiali eguagliano i solidi baseline del BPE, e le gerarchie più profonde mostrano una tendenza promettente. Poiché la tokenizzazione ora risiede all'interno del modello, lo stesso sistema può gestire task a livello di carattere e trasferire conoscenza tra lingue a bassa risorsa.

English

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and leave the model stuck with that choice. We relax this rigidity by introducing an autoregressive U-Net that learns to embed its own tokens as it trains. The network reads raw bytes, pools them into words, then pairs of words, then up to 4 words, giving it a multi-scale view of the sequence. At deeper stages, the model must predict further into the future -- anticipating the next few words rather than the next byte -- so deeper stages focus on broader semantic patterns while earlier stages handle fine details. When carefully tuning and controlling pretraining compute, shallow hierarchies tie strong BPE baselines, and deeper hierarchies have a promising trend. Because tokenization now lives inside the model, the same system can handle character-level tasks and carry knowledge across low-resource languages.

Da Byte a Idee: Modellazione del Linguaggio con U-Net Autoregressive

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Abstract

Support