ChatPaper.aiChatPaper

通过强化学习实现大语言模型的交错推理

Interleaved Reasoning for Large Language Models via Reinforcement Learning

May 26, 2025
作者: Roy Xie, David Qiu, Deepak Gopinath, Dong Lin, Yanchao Sun, Chong Wang, Saloni Potdar, Bhuwan Dhingra
cs.AI

摘要

长链思维(CoT)显著提升了大型语言模型(LLM)的推理能力。然而,冗长的推理轨迹导致了效率低下和首词生成时间(TTFT)的增加。我们提出了一种新颖的训练范式,利用强化学习(RL)引导推理型LLM在多跳问题中交替进行思考与回答。我们观察到,模型天生具备交替推理的能力,这一能力可通过RL进一步增强。我们引入了一种简单而有效的基于规则的奖励机制,以激励正确的中间步骤,通过利用交替推理过程中产生的中间信号,引导策略模型走向正确的推理路径。在五个多样化数据集和三种RL算法(PPO、GRPO和REINFORCE++)上进行的广泛实验表明,相较于传统的“思考-回答”推理方式,我们的方法在不依赖外部工具的情况下,实现了持续的性能提升。具体而言,该方法平均减少了超过80%的TTFT,并在Pass@1准确率上提升了高达19.3%。此外,仅基于问答和逻辑推理数据集训练的方法,在复杂推理数据集如MATH、GPQA和MMLU上展现了强大的泛化能力。我们还进行了深入分析,揭示了条件奖励建模中的若干宝贵洞见。
English
Long chain-of-thought (CoT) significantly enhances large language models' (LLM) reasoning capabilities. However, the extensive reasoning traces lead to inefficiencies and an increased time-to-first-token (TTFT). We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions. We observe that models inherently possess the ability to perform interleaved reasoning, which can be further enhanced through RL. We introduce a simple yet effective rule-based reward to incentivize correct intermediate steps, which guides the policy model toward correct reasoning paths by leveraging intermediate signals generated during interleaved reasoning. Extensive experiments conducted across five diverse datasets and three RL algorithms (PPO, GRPO, and REINFORCE++) demonstrate consistent improvements over traditional think-answer reasoning, without requiring external tools. Specifically, our approach reduces TTFT by over 80% on average and improves up to 19.3% in Pass@1 accuracy. Furthermore, our method, trained solely on question answering and logical reasoning datasets, exhibits strong generalization ability to complex reasoning datasets such as MATH, GPQA, and MMLU. Additionally, we conduct in-depth analysis to reveal several valuable insights into conditional reward modeling.

Summary

AI-Generated Summary

PDF103May 27, 2025