ChatPaper.aiChatPaper

勿需过度思考:优选简短思维链以提升大语言模型推理能力

Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

May 23, 2025
作者: Michael Hassid, Gabriel Synnaeve, Yossi Adi, Roy Schwartz
cs.AI

摘要

大型语言模型(LLMs)在执行复杂推理任务时,高度依赖于扩展测试时的计算资源,通过生成冗长的“思考”链来达成目标。尽管这种方法展现了令人瞩目的成果,但也伴随着巨大的计算成本和推理时间。在本研究中,我们质疑了长思考链必然带来更好推理能力的假设。我们首先证明,在单个问题中,较短的推理链显著更有可能得出正确答案——比同一问题中最长链的准确率高出多达34.5%。基于这些发现,我们提出了short-m@k,一种新颖的LLM推理方法。该方法并行执行k次独立生成,并在完成前m个思考过程后立即停止计算,最终答案通过这m条链的多数投票决定。在低计算资源设置下,基础的short-1@k展现出与标准多数投票相当甚至更优的性能,同时减少了高达40%的思考标记使用。short-3@k虽在效率上略逊于short-1@k,但在所有计算预算下均稳定超越多数投票,且仍大幅缩短了时间(最多减少33%的墙钟时间)。受此启发,我们利用短、长及随机选择的推理链对LLM进行了微调,发现基于较短链的训练能带来更佳表现。我们的研究结果提示,应重新审视当前LLM推理中测试时计算资源的使用方式,强调更长的“思考”未必意味着性能提升,反而可能适得其反,导致结果退化。
English
Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). Inspired by our results, we finetune an LLM using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer "thinking" does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.

Summary

AI-Generated Summary

PDF494May 28, 2025