ChatPaper.aiChatPaper

测试时的自我反思生成

Self-Reflective Generation at Test Time

October 3, 2025
作者: Jian Mu, Qixin Zhang, Zhiyong Wang, Menglin Yang, Shuang Qiu, Chengwei Qin, Zhongxiang Dai, Yao Shu
cs.AI

摘要

大型語言模型(LLMs)越來越多地通過長鏈式思維解決複雜的推理任務,但其僅向前進行的自回歸生成過程具有脆弱性;早期的詞元錯誤可能層層累積,這凸顯了自我反思機制的必要性。然而,現有的自我反思要么對完整草稿進行修訂,要么通過昂貴的訓練學習自我校正,這兩種方式本質上都是被動且低效的。為此,我們提出了測試時自我反思生成(SRGen),這是一個輕量級的測試時框架,在生成不確定點之前進行反思。在詞元生成過程中,SRGen利用動態熵閾值來識別高不確定性詞元。對於每個識別出的詞元,它訓練一個特定的校正向量,充分利用已生成的上下文進行自我反思生成,以校正詞元概率分佈。通過回顧性分析部分輸出,這種自我反思能夠做出更可信的決策,從而顯著降低在高不確定點出錯的概率。在具有挑戰性的數學推理基準測試和多樣化的LLMs上進行評估,SRGen能夠持續增強模型推理能力:單次通過質量的提升也轉化為更強的自一致性投票。特別是在AIME2024上使用DeepSeek-R1-Distill-Qwen-7B時,SRGen在Pass@1上實現了+12.0%的絕對提升,在Cons@5上實現了+13.3%的絕對提升。此外,我們的研究將SRGen定位為一種即插即用的方法,將反思整合到生成過程中,以實現可靠的LLM推理,在有限的開銷下實現一致的增益,並與其他訓練時(如RLHF)和測試時(如SLOT)技術具有廣泛的兼容性。
English
Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model reasoning: improvements in single-pass quality also translate into stronger self-consistency voting. Especially, on AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.
PDF92October 7, 2025