大型语言模型与视觉语言模型的递归思考-应答流程

摘要

诸如DeepSeek-R1等思维-解答推理器通过利用可解释的内部推理取得了显著进展。然而，尽管这些模型频繁出现"糟糕！"等自我反思提示，但在单次推理过程中仍易产生输出错误。为突破这一局限，我们提出高效的递归思维-解答流程（R-TAP），使模型能够进行迭代推理循环并生成更精确的答案，从而超越传统的单次推理方法。该方法的核心理念是构建置信度生成器，用于评估模型响应的确定性并指导后续改进。通过引入递归置信度提升奖励和最终答案置信度奖励这两个互补的奖励机制，我们证明增强R-TAP的大型语言模型和视觉语言模型均能持续超越传统单次推理方法。此外，通过分析模型响应中"糟糕"类表达的出现频率，我们发现应用R-TAP的模型展现出显著减少的自我反思模式，从而实现更稳定、更快速的推理过程。我们希望R-TAP能为发展高效精细的推理优化方法开辟道路，推动未来人工智能推理过程的演进。

English

Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like "Oops!", they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of "Oops"-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.

大型语言模型与视觉语言模型的递归思考-应答流程

Recursive Think-Answer Process for LLMs and VLMs

摘要

Support