ChatPaper.aiChatPaper

StyleBench:評估大型語言模型中的思維風格

StyleBench: Evaluating thinking styles in Large Language Models

September 25, 2025
作者: Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei
cs.AI

摘要

大型語言模型(LLMs)的有效性在很大程度上受到其提示中所採用的推理策略或思維風格的影響。然而,這些推理風格、模型架構與任務類型之間的相互作用仍未被充分理解。為此,我們引入了StyleBench,這是一個全面的基準測試,用於系統評估不同任務和模型中的推理風格。我們評估了五種代表性的推理風格,包括思維鏈(CoT)、思維樹(ToT)、思維算法(AoT)、思維草圖(SoT)和草稿鏈(CoD),並在五種推理任務上測試了來自主要模型家族(LLaMA、Qwen、Mistral、Gemma、GPT-OSS、Phi和DeepSeek)的15個開源模型,參數量從2.7億到1200億不等。我們的大規模分析表明,沒有一種風格是普遍最優的。我們證明,策略的有效性高度依賴於模型規模和任務類型:基於搜索的方法(AoT、ToT)在開放式問題中表現出色,但需要大規模模型,而簡潔的風格(SoT、CoD)在定義明確的任務上實現了顯著的效率提升。此外,我們識別出關鍵的行為模式:較小的模型經常無法遵循輸出指令,轉而依賴猜測,而推理的穩健性則隨著模型規模的增加而顯現。我們的研究結果為基於特定約束選擇最佳推理策略提供了重要的路線圖,並在https://github.com/JamesJunyuGuo/Style_Bench開源了該基準測試。
English
The effectiveness of Large Language Models (LLMs) is heavily influenced by the reasoning strategies, or styles of thought, employed in their prompts. However, the interplay between these reasoning styles, model architecture, and task type remains poorly understood. To address this, we introduce StyleBench, a comprehensive benchmark for systematically evaluating reasoning styles across diverse tasks and models. We assess five representative reasoning styles, including Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT), and Chain-of-Draft (CoD) on five reasoning tasks, using 15 open-source models from major families (LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi, and DeepSeek) ranging from 270M to 120B parameters. Our large-scale analysis reveals that no single style is universally optimal. We demonstrate that strategy efficacy is highly contingent on both model scale and task type: search-based methods (AoT, ToT) excel in open-ended problems but require large-scale models, while concise styles (SoT, CoD) achieve radical efficiency gains on well-defined tasks. Furthermore, we identify key behavioral patterns: smaller models frequently fail to follow output instructions and default to guessing, while reasoning robustness emerges as a function of scale. Our findings offer a crucial roadmap for selecting optimal reasoning strategies based on specific constraints, we open source the benchmark in https://github.com/JamesJunyuGuo/Style_Bench.
PDF32September 26, 2025