NITP：面向大语言模型预训练的下一隐式令牌预测

摘要

标准的下一个词预测（NTP）仅在输出logit空间通过离散标签监督语言模型。我们认为这种稀疏的独热监督使得潜在表示空间约束不足，导致隐状态可能漂移至退化的各向异性结构，从而限制泛化能力。为解决此问题，我们提出隐式下一个词预测（NITP），该方法直接在表示空间中用密集的连续监督增强离散预测。NITP训练模型预测下一个词的隐式语义内容，将同一模型的浅层表示作为稳定的自监督目标。理论分析表明，NITP通过缓解欠约束的自由度并鼓励紧凑、结构化的表示几何，对优化景观进行正则化。实验表明，在0.5B至9B参数规模的密集模型和MoE模型上，NITP以可忽略的计算开销持续提升下游性能。在9B参数的MoE模型上，NITP在MMLU-Pro上实现5.7%的绝对提升，同时在C3和CommonsenseQA上分别提升6.4%和4.3%，仅增加约2%的训练FLOPs且无额外推理成本。我们的实现已开源：https://github.com/aHapBean/NITP。

English

Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.