基於強化學習的大語言模型交錯推理

摘要

長鏈思維（CoT）顯著提升了大型語言模型（LLM）的推理能力。然而，冗長的推理軌跡導致效率低下和首個令牌生成時間（TTFT）增加。我們提出了一種新穎的訓練範式，利用強化學習（RL）來引導推理型LLM在多跳問題中交替進行思考和回答。我們觀察到，模型天生具備交替推理的能力，這一能力可通過RL進一步增強。我們引入了一種簡單而有效的基於規則的獎勵機制，以激勵正確的中間步驟，這通過利用交替推理過程中產生的中間信號，引導策略模型走向正確的推理路徑。在五個不同數據集和三種RL算法（PPO、GRPO和REINFORCE++）上進行的大量實驗表明，相較於傳統的“思考-回答”推理方式，我們的方法在不依賴外部工具的情況下，實現了持續的改進。具體而言，我們的方法平均將TTFT降低了80%以上，並在Pass@1準確率上提升了高達19.3%。此外，僅在問答和邏輯推理數據集上訓練的我們的方法，展現出對複雜推理數據集（如MATH、GPQA和MMLU）的強大泛化能力。同時，我們進行了深入分析，揭示了條件獎勵建模中的多項有價值的洞見。

English

Long chain-of-thought (CoT) significantly enhances large language models' (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective rule-based reward to incentivize correct intermediate steps, which guides the policy model toward correct reasoning paths by leveraging intermediate signals generated during interleaved reasoning. Extensive experiments conducted across five diverse datasets and three RL algorithms (PPO, GRPO, and REINFORCE++) demonstrate consistent improvements over traditional think-answer reasoning, without requiring external tools. Specifically, our approach reduces TTFT by over 80% on average and improves up to 19.3% in Pass@1 accuracy. Furthermore, our method, trained solely on question answering and logical reasoning datasets, exhibits strong generalization ability to complex reasoning datasets such as MATH, GPQA, and MMLU. Additionally, we conduct in-depth analysis to reveal several valuable insights into conditional reward modeling.

基於強化學習的大語言模型交錯推理

Interleaved Reasoning for Large Language Models via Reinforcement Learning

摘要

Support