少即是多：通過最小化測試時干預提升大語言模型的推理能力

摘要

近期大型語言模型（LLMs）的進展主要集中在測試階段的擴展，通過增加推理計算來提升推理能力，但這往往以效率為代價。我們重新審視了測試階段的行為，並發現了一個簡單卻未被充分探索的現象：推理的不確定性具有高度局部性——僅有一小部分高熵的token主導影響輸出的正確性。基於此，我們提出了最小測試階段干預（MTI），這是一個無需訓練的框架，能夠以最小的開銷提升推理的準確性和穩定性。MTI包括：（i）選擇性CFG干預，僅在不確定位置應用無分類器指導；以及（ii）輕量級負提示指導，重用主模型的KV緩存來高效地近似無條件解碼。MTI在通用、編程和STEM任務中均取得了穩定的增益——例如，Qwen3-8B-Base在八個基準測試中平均提升了1.35%，而使用Qwen3-32B-Reasoning在AIME2024上提升了5%——同時保持了極高的效率。

English

Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +1.35% average improvement on eight benchmarks for Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning-while remaining highly efficient.

少即是多：通過最小化測試時干預提升大語言模型的推理能力

Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

摘要

Support