NITP: Vorhersage des nächsten impliziten Tokens für das Pre-Training von LLMs

Zusammenfassung

Die standardmäßige Vorhersage des nächsten Tokens (Next-Token Prediction, NTP) überwacht Sprachmodelle ausschließlich über diskrete Labels im Ausgabelogit-Raum. Wir argumentieren, dass diese spärliche One-Hot-Überwachung den latenten Repräsentationsraum unterbestimmt lässt, sodass verborgene Zustände in entartete und anisotrope Konfigurationen abdriften können, die die Generalisierung einschränken. Um dieses Problem zu lösen, schlagen wir die Nächste-Implizite-Token-Vorhersage (Next Implicit Token Prediction, NITP) vor, die die diskrete Vorhersage um eine dichte kontinuierliche Überwachung direkt im Repräsentationsraum ergänzt. NITP trainiert das Modell, den impliziten semantischen Inhalt des nächsten Tokens vorherzusagen, indem Repräsentationen aus flachen Schichten desselben Modells als stabile selbstüberwachte Ziele verwendet werden. Wir liefern eine theoretische Analyse, die zeigt, dass NITP die Optimierungslandschaft regularisiert, indem es unterbestimmte Freiheitsgrade reduziert und eine kompakte, strukturierte Repräsentationsgeometrie fördert. Empirisch verbessert NITP bei dichten Modellen und MoE-Modellen mit 0,5 Mrd. bis 9 Mrd. Parametern konsistent die nachgelagerte Leistung bei vernachlässigbarem zusätzlichem Rechenaufwand. Bei einem 9B-MoE-Modell erreicht NITP eine absolute Verbesserung von 5,7 % auf MMLU-Pro sowie Zugewinne von 6,4 % auf C3 und 4,3 % auf CommonsenseQA, bei etwa 2 % zusätzlichen Trainings-FLOPs und ohne zusätzliche Inferenzkosten. Unsere Implementierung ist verfügbar unter https://github.com/aHapBean/NITP.

English

Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.