NITP: Volgende Impliciete Token Voorspelling voor LLM Pre-training

Samenvatting

Standaard volgende-tokenpredictie (NTP) superviseert taalmodellen uitsluitend via discrete labels in de output logit-ruimte. Wij stellen dat deze schaarse one-hot supervisie de latente representatieruimte onderbeperkt laat, waardoor verborgen toestanden kunnen afdrijven naar ontaarde en anisotrope configuraties die generalisatie kunnen beperken. Om dit probleem aan te pakken, stellen wij Next Implicit Token Prediction (NITP) voor, die discrete predictie uitbreidt met dichte continue supervisie direct in de representatieruimte. NITP traint het model om de impliciete semantische inhoud van het volgende token te voorspellen, waarbij gebruik wordt gemaakt van ondiepe-laag representaties uit hetzelfde model als stabiele zelf-gesuperviseerde doelen. Wij leveren theoretische analyse die aantoont dat NITP het optimalisatielandschap regulariseert door het verminderen van onderbeperkte vrijheidsgraden en het aanmoedigen van een compacte, gestructureerde representatiegeometrie. Empirisch gezien, over dichte en MoE-modellen variërend van 0,5B tot 9B parameters, verbetert NITP consistent de downstream-prestaties met verwaarloosbare computationele overhead. Op een 9B MoE-model bereikt NITP een absolute verbetering van 5,7% op MMLU-Pro, samen met winsten van 6,4% op C3 en 4,3% op CommonsenseQA, met ongeveer 2% extra trainings-FLOPs en geen extra inferentiekosten. Onze implementatie is beschikbaar op https://github.com/aHapBean/NITP.

English

Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.