데이터 제약 언어 모델 사전 학습을 위한 훈련 시점 데이터 증강 규명

초록

AI 연구소들이 연산 능력이 새로운 고품질 텍스트 생성 속도를 초과하는 데이터 한계에 도달함에 따라, 언어 모델 사전 학습은 데이터가 제약적이지만 연산은 풍부한 환경으로 전환되고 있으며, 이는 고정된 말뭉치에 대해 생산적인 다중 에폭 학습을 요구한다. 표준 자기회귀(AR) 사전 학습은 이러한 환경에서 심각한 과적합을 일으켜, 최적점에 조기 도달한 후 지속적으로 성능이 저하된다. 본 연구에서는 훈련 시 데이터 증강을 정규화 기법으로 활용하여 이러한 과적합을 완화하고 동일한 데이터로 수백 에폭의 생산적인 학습을 가능하게 한다. AR 사전 학습을 위한 세 가지 직교 증강 범주를 소개한다: 토큰 수준 노이즈(마스킹, 무작위 대체), 시퀀스 순열(오른쪽에서 왼쪽 예측, 중간 채우기), 목표 오프셋 예측(i > 1인 x_{t+i}). 체계적 제거 실험을 통해 개별 증강 기법이 과적합을 지연시키고 기준선 대비 검증 손실을 낮추며, 특히 무작위 토큰 대체가 개별 방법 중 최상의 최소 손실을 달성함을 발견했다. 증강 범주를 결합하면 최소 검증 손실이 더욱 낮아진다. 본 실험은 데이터 증강이 AR 사전 학습의 데이터 비효율성을 완화하며 데이터 제약적 환경에 대한 유망한 해결책을 제공함을 입증한다~\footnote{모든 코드와 데이터는 https://github.com/michaelchen-lab/data-augmentations-for-pretraining 에서 확인할 수 있다.}

English

As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate training-time data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction (x_{t+i} for i > 1). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime~\footnote{All code and data are available at https://github.com/ michaelchen-lab/ data-augmentations-for-pretraining.