BOW: ボトルネック型次単語探索

要旨

大規模言語モデル（LLM）は通常、次の単語予測（NWP）を通じて訓練されます。この方法は表面的な流暢さを強く提供しますが、堅牢な推論をサポートすることはしばしば欠けています。本研究では、BOttlenecked next Word exploration（BOW）という新しい強化学習（RL）フレームワークを提案します。BOWはNWPを再考し、推論のボトルネックを導入します。ここでは、ポリシーモデルが最初に次のトークンを直接予測するのではなく、推論パスを生成し、その後、凍結されたジャッジモデルがこの推論パスに基づいて次のトークン分布を予測します。ポリシーモデルは、推論パスが次の単語の回復をどれだけ効果的に促進するかを定量化する報酬を用いてGRPOで訓練されます。他の継続的プレトレーニングベースラインと比較して、BOWが基本モデルの一般的な推論能力と次の単語推論能力の両方を向上させることを、さまざまなベンチマークで評価し示します。我々の研究結果は、BOWが従来のNWPの効果的かつスケーラブルな代替手段として機能し得ることを示しています。

English

Large language models (LLMs) are typically trained via next-word prediction (NWP), which provides strong surface-level fluency but often lacks support for robust reasoning. We propose BOttlenecked next Word exploration (BOW), a novel RL framework that rethinks NWP by introducing a reasoning bottleneck where a policy model first generates a reasoning path rather than predicting the next token directly, after which a frozen judge model predicts the next token distribution based solely on this reasoning path. We train the policy model using GRPO with rewards that quantify how effectively the reasoning path facilitates next-word recovery. Compared with other continual pretraining baselines, we show that BOW improves both the general and next-word reasoning capabilities of the base model, evaluated on various benchmarks. Our findings show that BOW can serve as an effective and scalable alternative to vanilla NWP.

BOW: ボトルネック型次単語探索

BOW: Bottlenecked Next Word Exploration

要旨

Support