BOW：瓶颈式下一词探索

摘要

大型语言模型（LLMs）通常通过下一词预测（NWP）进行训练，这种方法虽能提供较强的表层流畅性，但往往缺乏对稳健推理的支持。我们提出了一种新颖的强化学习框架——瓶颈式下一词探索（BOW），该框架重新构思了NWP，引入了一个推理瓶颈：策略模型首先生成一条推理路径，而非直接预测下一个词元，随后，一个冻结的评判模型仅基于此推理路径预测下一个词元的分布。我们采用GRPO训练策略模型，奖励机制量化了推理路径在促进下一词恢复方面的有效性。与其它持续预训练基线相比，BOW在多个基准测试中均提升了基础模型的通用及下一词推理能力。我们的研究结果表明，BOW可作为传统NWP的一种有效且可扩展的替代方案。

English

Large language models (LLMs) are typically trained via next-word prediction (NWP), which provides strong surface-level fluency but often lacks support for robust reasoning. We propose BOttlenecked next Word exploration (BOW), a novel RL framework that rethinks NWP by introducing a reasoning bottleneck where a policy model first generates a reasoning path rather than predicting the next token directly, after which a frozen judge model predicts the next token distribution based solely on this reasoning path. We train the policy model using GRPO with rewards that quantify how effectively the reasoning path facilitates next-word recovery. Compared with other continual pretraining baselines, we show that BOW improves both the general and next-word reasoning capabilities of the base model, evaluated on various benchmarks. Our findings show that BOW can serve as an effective and scalable alternative to vanilla NWP.

BOW：瓶颈式下一词探索

BOW: Bottlenecked Next Word Exploration

摘要

Support