Demystificatie van trainings-tijd augmentatie voor data-beperkte taalmodel pretraining

Samenvatting

Nu AI-laboratoria een dataplafond naderen waarbij de rekenkracht de snelheid van nieuwe hoogwaardige tekstgeneratie overtreft, verschuift de pretraining van taalmodellen naar een databeperkt, rekenkracht-overvloedig regime dat productieve multi-epoche training op vaste corpora vereist. Standaard autoregressieve (AR) pretraining heeft in deze setting ernstig last van overfitting; het bereikt vroeg zijn optimum en verslechtert vervolgens continu. We onderzoeken data-augmentatie tijdens de training als regularisator om deze overfitting te beperken en productieve training voor honderden epochs op dezelfde data mogelijk te maken. We introduceren drie orthogonale categorieën van augmentatie voor AR-pretraining: ruis op token-niveau (maskeren, willekeurige vervanging), sequentiepermutaties (rechts-naar-links voorspelling, Fill-in-the-Middle) en voorspelling van doelverschuiving (x_{t+i} voor i > 1). Door systematische ablatiestudies vinden we dat individuele augmentaties overfitting vertragen en het validatieverlies verlagen ten opzichte van de basislijn, waarbij willekeurige tokenvervanging de beste minimale verlieswaarde behaalt onder de individuele methoden. Het combineren van augmentatiecategorieën verlaagt het minimale validatieverlies verder. Onze experimenten tonen aan dat data-augmentaties de data-inefficiëntie van AR-pretraining verminderen en een veelbelovende oplossing bieden voor het databeperkte regime~\footnote{Alle code en data zijn beschikbaar op https://github.com/michaelchen-lab/data-augmentations-for-pretraining.}

English

As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate training-time data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction (x_{t+i} for i > 1). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime~\footnote{All code and data are available at https://github.com/ michaelchen-lab/ data-augmentations-for-pretraining.