簡單的一句「再試一次」即可激發多輪大型語言模型推理

摘要

多轮问题解决对于大型推理模型（LRMs）而言至关重要，却也是一大挑战，要求其能够反思自身推理过程并根据反馈进行修正。现有的强化学习（RL）方法在单轮范式下训练大型推理模型，依赖于可验证的奖励机制。然而，我们观察到，采用现有RL范式训练的模型往往丧失在多轮中解决问题的能力，难以根据上下文反馈修正答案，导致重复性回应。我们提出疑问：LRMs能否学会在多轮情境下反思其答案？本研究中，我们发现，仅通过错误答案后的单一反馈（如“让我们再试一次”）进行多轮RL训练，不仅能提升单轮表现，还能增强多轮推理能力。我们引入了“单一反馈作为观察”（UFO）的强化学习策略，该策略在迭代问题解决过程中利用最小化但常见的单一用户反馈，并易于应用于现有的单轮RL训练框架。实验结果表明，结合UFO的RL训练保持了单轮性能，并将多轮推理准确率提升高达14%，使语言模型在多轮问题解决中能更好地响应反馈。为了进一步减少获得正确答案所需的轮次，同时鼓励在错误发生时进行多样化推理，我们设计了奖励结构，引导模型在每一轮中给出谨慎且深思熟虑的答案。代码详见：https://github.com/lichengliu03/unary-feedback

English

Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards. However, we observe that models trained with existing RL paradigms often lose their ability to solve problems across multiple turns and struggle to revise answers based on contextual feedback, leading to repetitive responses. We ask: can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (e.g., "Let's try again") after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling language models to better react to feedback in multi-turn problem solving. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn. Code: https://github.com/lichengliu03/unary-feedback

簡單的一句「再試一次」即可激發多輪大型語言模型推理

A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning

摘要

Support