テスト時における自己反映的生成

要旨

大規模言語モデル（LLMs）は、長い連鎖思考（chain-of-thought）を通じて複雑な推論タスクを解決することが増えているが、その前方のみの自己回帰的生成プロセスは脆弱であり、初期のトークンエラーが連鎖的に拡大する可能性がある。これにより、自己反映メカニズムの必要性が明確になっている。しかし、既存の自己反映手法は、完全なドラフト全体を修正するか、高コストなトレーニングを通じて自己修正を学習するものであり、いずれも根本的に反応的で非効率的である。この問題に対処するため、我々はテスト時に生成前に反映を行う軽量なフレームワーク「Self-Reflective Generation at Test Time（SRGen）」を提案する。SRGenは、トークン生成中に動的エントロピー閾値処理を用いて不確実性の高いトークンを特定する。特定された各トークンに対して、SRGenは特定の修正ベクトルをトレーニングし、既に生成されたコンテキストを最大限に活用して自己反映的な生成を行い、トークンの確率分布を修正する。部分的な出力を遡及的に分析することで、この自己反映はより信頼性の高い意思決定を可能にし、不確実性の高いポイントでのエラーの確率を大幅に低減する。挑戦的な数学的推論ベンチマークと多様なLLMsを用いた評価において、SRGenはモデルの推論能力を一貫して強化し、単一パスの品質向上がより強力な自己一貫性投票（self-consistency voting）にも繋がることが示された。特に、AIME2024におけるDeepSeek-R1-Distill-Qwen-7Bでは、SRGenによりPass@1で+12.0%、Cons@5で+13.3%の絶対的な改善が得られた。さらに、我々の知見は、SRGenを生成プロセスに反映を統合するプラグアンドプレイ手法として位置づけ、限定的なオーバーヘッドで一貫した利得を達成し、他のトレーニング時（例：RLHF）およびテスト時（例：SLOT）技術との広範な互換性を実現する。

English

Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model reasoning: improvements in single-pass quality also translate into stronger self-consistency voting. Especially, on AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.