大規模言語モデルと視覚言語モデルにおける再帰的思考・回答プロセス

要旨

DeepSeek-R1に代表されるThink-Answer型推論モデルは、解釈可能な内部推論を活用することで顕著な進歩を遂げてきた。しかし、「おっと！」といった自己反省的合図が頻繁に現れるにもかかわらず、単一パス推論時の出力誤りに対して依然として脆弱である。この課題を解決するため、我々は効率的な再帰的Think-Answerプロセス（R-TAP）を提案する。これは従来の単一パス手法を超え、モデルが反復的な推論サイクルに参加し、より正確な回答を生成することを可能にする。この手法の中核となるのは、モデル応答の確実性を評価し、その後の改善を導く信頼度生成器である。相補的な二つの報酬——再帰的信頼度向上報酬と最終回答信頼度報酬——を組み込むことで、R-TAPを強化したモデルが大規模言語モデル（LLM）と視覚言語モデル（VLM）の両方において、従来の単一パス手法を一貫して上回ることを示す。さらに、モデル応答における「おっと」的表現の頻度を分析した結果、R-TAPを適用したモデルは自己反省的パターンが著しく減少し、より安定した高速な推論時 reasoning を実現することがわかった。R-TAPが将来のAIの推論プロセスを洗練させる、効率的で精巧な手法へと進化する道を開くことを期待する。

English

Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like "Oops!", they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of "Oops"-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.

大規模言語モデルと視覚言語モデルにおける再帰的思考・回答プロセス

Recursive Think-Answer Process for LLMs and VLMs

要旨

Support