测试时的自我反思生成
Self-Reflective Generation at Test Time
October 3, 2025
作者: Jian Mu, Qixin Zhang, Zhiyong Wang, Menglin Yang, Shuang Qiu, Chengwei Qin, Zhongxiang Dai, Yao Shu
cs.AI
摘要
大型语言模型(LLMs)正越来越多地通过长链思维解决复杂推理任务,但其仅前向的自回归生成过程存在脆弱性;早期的令牌错误可能引发连锁反应,这凸显了对自我反思机制的需求。然而,现有的自我反思要么是对完整草稿进行修订,要么通过昂贵的训练学习自我纠正,这两种方式本质上都是被动且低效的。针对这一问题,我们提出了测试时自我反思生成(SRGen),这是一个轻量级的测试时框架,在不确定点生成前进行反思。在令牌生成过程中,SRGen利用动态熵阈值识别高不确定性令牌。对于每个识别出的令牌,它训练一个特定的纠正向量,充分利用已生成的上下文进行自我反思生成,以修正令牌概率分布。通过回顾性分析部分输出,这种自我反思能够做出更可信的决策,从而显著降低在高度不确定点出错的概率。在具有挑战性的数学推理基准测试和多种LLMs上的评估表明,SRGen能够持续增强模型推理能力:单次通过质量的提升也转化为更强的自一致性投票。特别是在AIME2024上使用DeepSeek-R1-Distill-Qwen-7B时,SRGen在Pass@1上带来了+12.0%的绝对提升,在Cons@5上提升了+13.3%。此外,我们的研究将SRGen定位为一种即插即用的方法,它将反思整合到生成过程中,以实现可靠的LLM推理,在有限的开销下实现一致的性能提升,并与其他训练时(如RLHF)和测试时(如SLOT)技术广泛兼容。
English
Large language models (LLMs) increasingly solve complex reasoning tasks via
long chain-of-thought, but their forward-only autoregressive generation process
is fragile; early token errors can cascade, which creates a clear need for
self-reflection mechanisms. However, existing self-reflection either performs
revisions over full drafts or learns self-correction via expensive training,
both fundamentally reactive and inefficient. To address this, we propose
Self-Reflective Generation at Test Time (SRGen), a lightweight test-time
framework that reflects before generating at uncertain points. During token
generation, SRGen utilizes dynamic entropy thresholding to identify
high-uncertainty tokens. For each identified token, it trains a specific
corrective vector, which fully exploits the already generated context for a
self-reflective generation to correct the token probability distribution. By
retrospectively analyzing the partial output, this self-reflection enables more
trustworthy decisions, thereby significantly reducing the probability of errors
at highly uncertain points. Evaluated on challenging mathematical reasoning
benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model
reasoning: improvements in single-pass quality also translate into stronger
self-consistency voting. Especially, on AIME2024 with
DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on
Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a
plug-and-play method that integrates reflection into the generation process for
reliable LLM reasoning, achieving consistent gains with bounded overhead and
broad composability with other training-time (e.g., RLHF) and test-time (e.g.,
SLOT) techniques.