少即是多:通過最小化測試時干預提升大語言模型的推理能力
Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention
October 15, 2025
作者: Zhen Yang, Mingyang Zhang, Feng Chen, Ganggui Ding, Liang Hou, Xin Tao, Pengfei Wan, Ying-Cong Chen
cs.AI
摘要
近期大型語言模型(LLMs)的進展主要集中在測試階段的擴展,通過增加推理計算來提升推理能力,但這往往以效率為代價。我們重新審視了測試階段的行為,並發現了一個簡單卻未被充分探索的現象:推理的不確定性具有高度局部性——僅有一小部分高熵的token主導影響輸出的正確性。基於此,我們提出了最小測試階段干預(MTI),這是一個無需訓練的框架,能夠以最小的開銷提升推理的準確性和穩定性。MTI包括:(i)選擇性CFG干預,僅在不確定位置應用無分類器指導;以及(ii)輕量級負提示指導,重用主模型的KV緩存來高效地近似無條件解碼。MTI在通用、編程和STEM任務中均取得了穩定的增益——例如,Qwen3-8B-Base在八個基準測試中平均提升了1.35%,而使用Qwen3-32B-Reasoning在AIME2024上提升了5%——同時保持了極高的效率。
English
Recent progress in large language models (LLMs) has focused on test-time
scaling to improve reasoning via increased inference computation, but often at
the cost of efficiency. We revisit test-time behavior and uncover a simple yet
underexplored phenomenon: reasoning uncertainty is highly localized-only a
small subset of high-entropy tokens dominantly affects output correctness.
Motivated by this, we propose Minimal Test-Time Intervention (MTI), a
training-free framework that enhances reasoning accuracy and stability with
minimal overhead. MTI includes: (i) Selective CFG intervention, applying
classifier-free guidance only at uncertain positions; and (ii) Lightweight
negative-prompt guidance, reusing the main model's KV cache to approximate
unconditional decoding efficiently. MTI yields consistent gains across general,
coding, and STEM tasks-e.g., +1.35% average improvement on eight benchmarks for
Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning-while remaining
highly efficient.