強化学習による大規模言語モデルのためのインターリーブ推論

要旨

長い思考連鎖（CoT）は、大規模言語モデル（LLM）の推論能力を大幅に向上させます。しかし、その広範な推論トレースは非効率性と初回トークンまでの時間（TTFT）の増加を引き起こします。本論文では、強化学習（RL）を用いて、多段階質問に対する思考と回答を交互に行うように推論LLMを導く新しいトレーニングパラダイムを提案します。モデルが本質的に交互推論を行う能力を持っており、それをRLによってさらに強化できることを観察しました。正しい中間ステップを奨励するためのシンプルで効果的なルールベースの報酬を導入し、交互推論中に生成される中間信号を活用して、ポリシーモデルを正しい推論パスに導きます。5つの多様なデータセットと3つのRLアルゴリズム（PPO、GRPO、REINFORCE++）を用いた広範な実験により、外部ツールを必要とせずに、従来の思考-回答推論を一貫して改善することが実証されました。具体的には、本手法はTTFTを平均で80%以上削減し、Pass@1精度で最大19.3%向上させました。さらに、質問応答と論理推論のデータセットのみでトレーニングされた本手法は、MATH、GPQA、MMLUなどの複雑な推論データセットに対して強い汎化能力を示します。加えて、条件付き報酬モデリングに関するいくつかの貴重な洞察を明らかにするための詳細な分析を行いました。

English

Long chain-of-thought (CoT) significantly enhances large language models' (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective rule-based reward to incentivize correct intermediate steps, which guides the policy model toward correct reasoning paths by leveraging intermediate signals generated during interleaved reasoning. Extensive experiments conducted across five diverse datasets and three RL algorithms (PPO, GRPO, and REINFORCE++) demonstrate consistent improvements over traditional think-answer reasoning, without requiring external tools. Specifically, our approach reduces TTFT by over 80% on average and improves up to 19.3% in Pass@1 accuracy. Furthermore, our method, trained solely on question answering and logical reasoning datasets, exhibits strong generalization ability to complex reasoning datasets such as MATH, GPQA, and MMLU. Additionally, we conduct in-depth analysis to reveal several valuable insights into conditional reward modeling.

強化学習による大規模言語モデルのためのインターリーブ推論

Interleaved Reasoning for Large Language Models via Reinforcement Learning

要旨

Support