思緒四散：論o1-Like LLMs 的思考不足

摘要

大型語言模型（LLMs）如OpenAI的o1已經展示出在複雜推理任務中的卓越能力，通過擴展測試時計算並展現類似人類深度思考的特性。然而，我們識別到一種我們稱之為“思維不足”的現象，即類似o1的LLMs經常在不同推理思維之間切換，而沒有充分探索有潛力達到正確解決方案的途徑。這種行為導致推理深度不足和性能下降，特別是在具有挑戰性的數學問題上。為了系統地分析這個問題，我們在三個具有挑戰性的測試集和兩個代表性的開源o1-like模型上進行實驗，揭示頻繁的思維切換與錯誤回答之間的相關性。我們引入了一個新的指標來量化思維不足，通過測量錯誤答案中的標記效率來衡量。為了解決思維不足問題，我們提出了一種解碼策略，即思維切換懲罰（TIP），它阻止過早地在不同思維之間轉換，鼓勵對每個推理路徑進行更深入的探索。實驗結果表明，我們的方法提高了在具有挑戰性的數據集上的準確性，而無需對模型進行微調。我們的研究有助於理解o1-like LLMs中推理效率的問題，並提供了一個實用的解決方案來增強它們的解決問題的能力。

English

Large language models (LLMs) such as OpenAI's o1 have demonstrated remarkable abilities in complex reasoning tasks by scaling test-time compute and exhibiting human-like deep thinking. However, we identify a phenomenon we term underthinking, where o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. This behavior leads to inadequate depth of reasoning and decreased performance, particularly on challenging mathematical problems. To systematically analyze this issue, we conduct experiments on three challenging test sets and two representative open-source o1-like models, revealing that frequent thought switching correlates with incorrect responses. We introduce a novel metric to quantify underthinking by measuring token efficiency in incorrect answers. To address underthinking, we propose a decoding strategy with thought switching penalty TIP that discourages premature transitions between thoughts, encouraging deeper exploration of each reasoning path. Experimental results demonstrate that our approach improves accuracy across challenging datasets without requiring model fine-tuning. Our findings contribute to understanding reasoning inefficiencies in o1-like LLMs and offer a practical solution to enhance their problem-solving capabilities.

思緒四散：論o1-Like LLMs 的思考不足

Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

摘要

Support