ChatPaper.aiChatPaper

勿需過度思考。選擇更短的思維鏈以提升大型語言模型的推理能力

Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

May 23, 2025
作者: Michael Hassid, Gabriel Synnaeve, Yossi Adi, Roy Schwartz
cs.AI

摘要

推理大型語言模型(LLMs)在執行複雜推理任務時,主要依賴於擴展測試時的計算資源,通過生成大量的“思考”鏈來實現。儘管這種方法展示了令人印象深刻的成果,但它也帶來了顯著的計算成本和推理時間。在本研究中,我們挑戰了“更長的思考鏈會帶來更好的推理能力”這一假設。我們首先證明,在單個問題中,較短的推理鏈顯著更有可能得出正確答案——比同一問題中最長鏈的準確率高出最多34.5%。基於這些結果,我們提出了short-m@k,一種新穎的推理LLM推理方法。該方法並行執行k次獨立生成,並在完成前m個思考過程後停止計算。最終答案通過這些m個鏈的多數投票選出。基本的short-1@k在低計算設置下表現出與標準多數投票相似甚至更優的性能——最多減少40%的思考標記。short-3@k雖然效率略低於short-1@k,但在所有計算預算下始終超越多數投票,同時仍顯著更快(最多減少33%的實際時間)。受我們結果的啟發,我們使用短、長和隨機選擇的推理鏈對LLM進行了微調。隨後觀察到,基於較短鏈的訓練能帶來更好的性能。我們的研究結果提示,應重新審視當前推理LLM中測試時計算的方法,強調更長的“思考”並不一定轉化為性能提升,反而可能導致結果退化。
English
Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). Inspired by our results, we finetune an LLM using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer "thinking" does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.

Summary

AI-Generated Summary

PDF544May 28, 2025