简单的“再试一次”即可激发多轮LLM推理

摘要

多轮问题解决对于大型推理模型（LRMs）而言至关重要，但也极具挑战性，它要求模型能够反思其推理过程并根据反馈进行修正。现有的强化学习（RL）方法在单轮范式下训练大型推理模型，依赖可验证的奖励机制。然而，我们观察到，采用现有RL范式训练的模型往往丧失跨多轮解决问题的能力，难以根据上下文反馈修正答案，导致重复性回应。我们提出疑问：LRMs能否在多轮情境中学会反思其答案？本研究中，我们发现，仅通过错误答案后的简单一元反馈（如“让我们再试一次”）进行多轮RL训练，不仅能提升单轮表现，还能增强多轮推理能力。我们引入了“一元反馈作为观察”（UFO）的强化学习策略，它在迭代问题解决过程中利用最小化但常见的一元用户反馈，易于整合到现有的单轮RL训练框架中。实验结果显示，采用UFO的RL训练保持了单轮性能，并将多轮推理准确率提升高达14%，使语言模型在多轮问题解决中能更有效地响应反馈。为进一步减少获得正确答案所需的轮次，同时鼓励在错误发生时进行多样化推理，我们设计了奖励结构，引导模型在每一轮中生成谨慎且深思熟虑的答案。代码地址：https://github.com/lichengliu03/unary-feedback

English

Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards. However, we observe that models trained with existing RL paradigms often lose their ability to solve problems across multiple turns and struggle to revise answers based on contextual feedback, leading to repetitive responses. We ask: can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (e.g., "Let's try again") after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling language models to better react to feedback in multi-turn problem solving. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn. Code: https://github.com/lichengliu03/unary-feedback

简单的“再试一次”即可激发多轮LLM推理

A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning

摘要

Support