NITP: 次暗黙トークン予測によるLLM事前学習

要旨

標準的な次トークン予測（NTP）は、出力ロジット空間における離散ラベルのみを用いて言語モデルを教師あり学習する。本稿では、この疎なワンホット教師信号により潜在表現空間が十分に拘束されず、隠れ状態が退化した異方性配置へと漂流し、汎化性能を制限し得ると主張する。この問題に対処するため、我々は次暗黙トークン予測（NITP）を提案する。NITPは、離散予測に加えて、表現空間内で直接的に密な連続教師信号を導入する。NITPは、同一モデルの浅い層の表現を安定した自己教師付きターゲットとして使用し、次トークンの暗黙的な意味内容を予測するようモデルを訓練する。理論的な解析により、NITPが拘束されていない自由度を緩和し、コンパクトで構造化された表現幾何を促進することで、最適化のランドスケープを正則化することを示す。実験的には、0.5Bから9Bパラメータの高密度モデルおよびMoEモデルにおいて、NITPは無視できる計算オーバーヘッドで下流性能を一貫して向上させる。9BのMoEモデルでは、NITPはMMLU-Proで5.7%の絶対改善、C3で6.4%、CommonsenseQAで4.3%の改善を達成し、訓練FLOPsは約2%の追加、推論コストは追加なしである。実装はhttps://github.com/aHapBean/NITPで公開している。

English

Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.