ChatPaper.aiChatPaper

多样性还是精确性?深度解析下一代标记预测技术

Diversity or Precision? A Deep Dive into Next Token Prediction

December 28, 2025
作者: Haoyuan Wu, Hai Wang, Jiajia Wu, Jinxiang Ou, Keyao Wang, Weile Chen, Zihao Zheng, Bei Yu
cs.AI

摘要

近期研究表明,強化學習(RL)能顯著提升大型語言模型(LLM)的推理能力。然而,這類RL訓練的有效性關鍵取決於預訓練模型詞元輸出分佈所定義的探索空間。本文重新審視標準交叉熵損失函數,將其解讀為應用於單步情境的策略梯度優化特例。為系統性研究預訓練分佈如何影響後續RL的探索潛力,我們提出一種廣義預訓練目標,將同策略RL原理適配至監督學習。通過將下一詞元預測構建為隨機決策過程,我們引入能顯著平衡多樣性與精確度的獎勵塑形策略。該方法採用正向獎勵縮放因子控制真實詞元的概率集中度,並結合區分高低排名負向詞元的秩感知機制,從而重塑預訓練詞元輸出分佈,探究如何為RL提供更有利的探索空間以提升端到端推理性能。與高分佈熵有助探索的直覺相反,我們發現施加以精確度為導向的先驗分佈能為RL創造更優的探索空間。
English
Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model's token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.
PDF41January 6, 2026