基於強化學習的大語言模型交錯推理
Interleaved Reasoning for Large Language Models via Reinforcement Learning
May 26, 2025
作者: Roy Xie, David Qiu, Deepak Gopinath, Dong Lin, Yanchao Sun, Chong Wang, Saloni Potdar, Bhuwan Dhingra
cs.AI
摘要
長鏈思維(CoT)顯著提升了大型語言模型(LLM)的推理能力。然而,冗長的推理軌跡導致效率低下和首個令牌生成時間(TTFT)增加。我們提出了一種新穎的訓練範式,利用強化學習(RL)來引導推理型LLM在多跳問題中交替進行思考和回答。我們觀察到,模型天生具備交替推理的能力,這一能力可通過RL進一步增強。我們引入了一種簡單而有效的基於規則的獎勵機制,以激勵正確的中間步驟,這通過利用交替推理過程中產生的中間信號,引導策略模型走向正確的推理路徑。在五個不同數據集和三種RL算法(PPO、GRPO和REINFORCE++)上進行的大量實驗表明,相較於傳統的“思考-回答”推理方式,我們的方法在不依賴外部工具的情況下,實現了持續的改進。具體而言,我們的方法平均將TTFT降低了80%以上,並在Pass@1準確率上提升了高達19.3%。此外,僅在問答和邏輯推理數據集上訓練的我們的方法,展現出對複雜推理數據集(如MATH、GPQA和MMLU)的強大泛化能力。同時,我們進行了深入分析,揭示了條件獎勵建模中的多項有價值的洞見。
English
Long chain-of-thought (CoT) significantly enhances large language models'
(LLM) reasoning capabilities. However, the extensive reasoning traces lead to
inefficiencies and an increased time-to-first-token (TTFT). We propose a novel
training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs
to interleave thinking and answering for multi-hop questions. We observe that
models inherently possess the ability to perform interleaved reasoning, which
can be further enhanced through RL. We introduce a simple yet effective
rule-based reward to incentivize correct intermediate steps, which guides the
policy model toward correct reasoning paths by leveraging intermediate signals
generated during interleaved reasoning. Extensive experiments conducted across
five diverse datasets and three RL algorithms (PPO, GRPO, and REINFORCE++)
demonstrate consistent improvements over traditional think-answer reasoning,
without requiring external tools. Specifically, our approach reduces TTFT by
over 80% on average and improves up to 19.3% in Pass@1 accuracy. Furthermore,
our method, trained solely on question answering and logical reasoning
datasets, exhibits strong generalization ability to complex reasoning datasets
such as MATH, GPQA, and MMLU. Additionally, we conduct in-depth analysis to
reveal several valuable insights into conditional reward modeling.Summary
AI-Generated Summary