少ないほど良い：最小限のテスト時介入によるLLM推論能力の向上

要旨

大規模言語モデル（LLM）の最近の進展は、推論計算の増加を通じて推論能力を向上させるためのテスト時スケーリングに焦点を当ててきましたが、しばしば効率性を犠牲にしています。私たちはテスト時の挙動を再検討し、単純ながらも未開拓の現象を発見しました：推論の不確実性は高度に局所化されており、高エントロピーのトークンの小さなサブセットが出力の正確性に支配的な影響を与えるということです。これに動機づけられて、私たちはMinimal Test-Time Intervention（MTI）を提案します。これは、最小限のオーバーヘッドで推論の正確性と安定性を向上させるトレーニング不要のフレームワークです。MTIには以下が含まれます：（i）Selective CFG intervention、不確実な位置でのみclassifier-free guidanceを適用する；（ii）Lightweight negative-prompt guidance、メインモデルのKVキャッシュを再利用して無条件デコードを効率的に近似する。MTIは、一般的なタスク、コーディングタスク、STEMタスクにわたって一貫した改善をもたらします。例えば、Qwen3-8B-Baseでは8つのベンチマークで平均+1.35%の改善、Qwen3-32B-ReasoningではAIME2024で+5%の改善を達成しつつ、高い効率性を維持しています。

English

Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +1.35% average improvement on eight benchmarks for Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning-while remaining highly efficient.

少ないほど良い：最小限のテスト時介入によるLLM推論能力の向上

Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

要旨

Support