NITP：用於大型語言模型預訓練的下一個隱式標記預測

摘要

標準的下一個詞元預測（NTP）僅透過輸出對數空間中的離散標籤來監督語言模型。我們主張這種稀疏的獨熱監督方式對潛在表示空間的約束不足，使得隱藏狀態可能漂移成退化且各向異性的結構，從而限制泛化能力。為了解決此問題，我們提出隱式下一個詞元預測（NITP），該方法直接在表示空間中透過密集連續監督來增強離散預測。NITP訓練模型預測下一個詞元的隱式語義內容，並使用同一模型中的淺層表示作為穩定的自監督目標。我們提供理論分析，證明NITP透過減輕約束不足的自由度，並促進緊湊且結構化的表示幾何，從而正則化最佳化景觀。在實驗上，針對從0.5B到9B參數的密集模型與專家混合模型，NITP在計算開銷可忽略的情況下持續提升下游任務性能。在一個9B的專家混合模型上，NITP在MMLU-Pro上實現了5.7%的絕對提升，同時在C3和CommonsenseQA上分別獲得6.4%和4.3%的提升，且僅增加約2%的訓練浮點運算次數，無額外推理成本。我們的實作可於https://github.com/aHapBean/NITP取得。

English

Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.