ChatPaper.aiChatPaper

多样性还是精确性?深入探讨下一词元预测

Diversity or Precision? A Deep Dive into Next Token Prediction

December 28, 2025
作者: Haoyuan Wu, Hai Wang, Jiajia Wu, Jinxiang Ou, Keyao Wang, Weile Chen, Zihao Zheng, Bei Yu
cs.AI

摘要

近期研究表明,强化学习能显著提升大语言模型的推理能力。然而此类RL训练的有效性,关键取决于预训练模型词元输出分布所定义的探索空间。本文重新审视标准交叉熵损失函数,将其解读为应用于单步决策场景的策略梯度优化特例。为系统研究预训练分布如何影响后续RL的探索潜力,我们提出一种将同策略RL原则适配至监督学习的广义预训练目标。通过将下一词元预测构建为随机决策过程,我们引入显式平衡多样性与精确度的奖励塑造策略:采用正奖励缩放因子控制真实标签词元的概率集中度,并实施区分高低排名负样本的非对称排序感知机制。借此重塑预训练词元输出分布,探究如何为RL提供更有利的探索空间,最终提升端到端推理性能。与"高分布熵促进有效探索"的直觉相反,我们发现施加以精确性为导向的先验分布能为RL创造更优越的探索空间。
English
Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model's token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.
PDF41January 6, 2026