深く考えすぎないように。より短い思考連鎖を選ぶことでLLMの推論能力を向上させる

要旨

大規模言語モデル（LLM）の推論能力は、複雑な推論タスクを実行するために、大規模な「思考」連鎖を生成するためのテスト時計算リソースのスケーリングに大きく依存しています。このアプローチは印象的な結果を示す一方で、多大な計算コストと推論時間を伴います。本研究では、長い思考連鎖が必ずしも優れた推論能力につながるという前提に疑問を投げかけます。まず、個々の質問内で短い推論連鎖の方が正解を得る可能性が大幅に高いことを示します - 同じ質問に対してサンプリングされた最長の連鎖よりも最大34.5%精度が向上します。これらの結果に基づき、新しいLLM推論手法であるshort-m@kを提案します。この手法では、k個の独立した生成を並列に実行し、最初のm個の思考プロセスが完了した時点で計算を停止します。最終的な答えは、これらのm個の連鎖の多数決によって選択されます。基本的なshort-1@kは、低計算リソース設定において標準的な多数決と同等またはそれ以上の性能を示し、最大40%少ない思考トークンを使用します。short-3@kは、short-1@kよりも若干効率が劣るものの、すべての計算予算において一貫して多数決を上回り、依然として大幅に高速です（最大33%の壁時間短縮）。これらの結果に触発され、短い、長い、およびランダムに選択された推論連鎖を使用してLLMをファインチューニングしました。その結果、短い連鎖でトレーニングを行う方がより良い性能を発揮することが観察されました。本研究の知見は、推論LLMにおけるテスト時計算リソースの使用方法を見直す必要性を示唆しており、より長い「思考」が必ずしも性能向上につながらず、逆説的に結果を悪化させる可能性があることを強調しています。

English

Reasoning large language models (LLMs) heavily rely on scaling test-time compute to perform complex reasoning tasks by generating extensive "thinking" chains. While demonstrating impressive results, this approach incurs significant computational costs and inference time. In this work, we challenge the assumption that long thinking chains results in better reasoning capabilities. We first demonstrate that shorter reasoning chains within individual questions are significantly more likely to yield correct answers - up to 34.5% more accurate than the longest chain sampled for the same question. Based on these results, we suggest short-m@k, a novel reasoning LLM inference method. Our method executes k independent generations in parallel and halts computation once the first m thinking processes are done. The final answer is chosen using majority voting among these m chains. Basic short-1@k demonstrates similar or even superior performance over standard majority voting in low-compute settings - using up to 40% fewer thinking tokens. short-3@k, while slightly less efficient than short-1@k, consistently surpasses majority voting across all compute budgets, while still being substantially faster (up to 33% wall time reduction). Inspired by our results, we finetune an LLM using short, long, and randomly selected reasoning chains. We then observe that training on the shorter ones leads to better performance. Our findings suggest rethinking current methods of test-time compute in reasoning LLMs, emphasizing that longer "thinking" does not necessarily translate to improved performance and can, counter-intuitively, lead to degraded results.

深く考えすぎないように。より短い思考連鎖を選ぶことでLLMの推論能力を向上させる

Don't Overthink it. Preferring Shorter Thinking Chains for Improved LLM Reasoning

要旨

Support