勿需過度思考。選擇更短的思維鏈以提升大型語言模型的推理能力
Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning
May 23, 2025
作者: Michael Hassid, Gabriel Synnaeve, Yossi Adi, Roy Schwartz
cs.AI
摘要
推理大型語言模型(LLMs)在執行複雜推理任務時,主要依賴於擴展測試時的計算資源,通過生成大量的“思考”鏈來實現。儘管這種方法展示了令人印象深刻的成果,但它也帶來了顯著的計算成本和推理時間。在本研究中,我們挑戰了“更長的思考鏈會帶來更好的推理能力”這一假設。我們首先證明,在單個問題中,較短的推理鏈顯著更有可能得出正確答案——比同一問題中最長鏈的準確率高出最多34.5%。基於這些結果,我們提出了short-m@k,一種新穎的推理LLM推理方法。該方法並行執行k次獨立生成,並在完成前m個思考過程後停止計算。最終答案通過這些m個鏈的多數投票選出。基本的short-1@k在低計算設置下表現出與標準多數投票相似甚至更優的性能——最多減少40%的思考標記。short-3@k雖然效率略低於short-1@k,但在所有計算預算下始終超越多數投票,同時仍顯著更快(最多減少33%的實際時間)。受我們結果的啟發,我們使用短、長和隨機選擇的推理鏈對LLM進行了微調。隨後觀察到,基於較短鏈的訓練能帶來更好的性能。我們的研究結果提示,應重新審視當前推理LLM中測試時計算的方法,強調更長的“思考”並不一定轉化為性能提升,反而可能導致結果退化。
English
Reasoning large language models (LLMs) heavily rely on scaling test-time
compute to perform complex reasoning tasks by generating extensive "thinking"
chains. While demonstrating impressive results, this approach incurs
significant computational costs and inference time. In this work, we challenge
the assumption that long thinking chains results in better reasoning
capabilities. We first demonstrate that shorter reasoning chains within
individual questions are significantly more likely to yield correct answers -
up to 34.5% more accurate than the longest chain sampled for the same question.
Based on these results, we suggest short-m@k, a novel reasoning LLM inference
method. Our method executes k independent generations in parallel and halts
computation once the first m thinking processes are done. The final answer is
chosen using majority voting among these m chains. Basic short-1@k demonstrates
similar or even superior performance over standard majority voting in
low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while
slightly less efficient than short-1@k, consistently surpasses majority voting
across all compute budgets, while still being substantially faster (up to 33%
wall time reduction). Inspired by our results, we finetune an LLM using short,
long, and randomly selected reasoning chains. We then observe that training on
the shorter ones leads to better performance. Our findings suggest rethinking
current methods of test-time compute in reasoning LLMs, emphasizing that longer
"thinking" does not necessarily translate to improved performance and can,
counter-intuitively, lead to degraded results.Summary
AI-Generated Summary