大型语言模型与视觉语言模型的递归思考-应答流程

摘要

诸如DeepSeek-R1等思维-解答推理器通过引入可解释的内部推理机制取得了显著进展。然而，尽管这些模型频繁出现"糟糕！"等自我反思提示，但在单轮推理过程中仍易产生输出错误。为突破这一局限，我们提出高效的递归思维-解答流程（R-TAP），使模型能够进行迭代式推理循环，生成比传统单轮方法更精准的答案。该方法的核心理念是构建置信度生成器，用于评估模型响应的确定性并指导后续改进。通过引入递归置信度增长奖励和最终答案置信度奖励这两个互补的激励机制，我们发现经R-TAP增强的模型在大语言模型和视觉语言模型任务中均持续超越传统单轮方法。此外，通过分析模型响应中"糟糕"类表达的频率，我们发现应用R-TAP的模型展现出显著减少的自我反思模式，从而实现更稳定、更快速的推理过程。我们期待R-TAP能为开发高效精细的推理优化方法开辟新路径，推动未来人工智能推理能力的演进。

English

Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like "Oops!", they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of "Oops"-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.