ChatPaper.aiChatPaper

StyleBench:评估大型语言模型的思维风格

StyleBench: Evaluating thinking styles in Large Language Models

September 25, 2025
作者: Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei
cs.AI

摘要

大型语言模型(LLMs)的有效性在很大程度上受到其提示中所采用的推理策略或思维风格的影响。然而,这些推理风格、模型架构与任务类型之间的相互作用仍鲜为人知。为此,我们引入了StyleBench,一个全面评估不同任务和模型间推理风格的基准测试。我们评估了五种代表性推理风格,包括思维链(CoT)、思维树(ToT)、思维算法(AoT)、思维草图(SoT)和草稿链(CoD),在五个推理任务上,使用来自主要家族(LLaMA、Qwen、Mistral、Gemma、GPT-OSS、Phi和DeepSeek)的15个开源模型,参数规模从270M到120B不等。我们的大规模分析表明,没有一种风格是普遍最优的。我们证明,策略的有效性高度依赖于模型规模和任务类型:基于搜索的方法(AoT、ToT)在开放性问题中表现出色,但需要大规模模型;而简洁风格(SoT、CoD)在定义明确的任务上实现了显著的效率提升。此外,我们识别出关键行为模式:较小模型经常无法遵循输出指令,倾向于猜测,而推理的鲁棒性随着模型规模的增大而增强。我们的发现为根据特定约束选择最优推理策略提供了重要指南,我们已在https://github.com/JamesJunyuGuo/Style_Bench开源了该基准测试。
English
The effectiveness of Large Language Models (LLMs) is heavily influenced by the reasoning strategies, or styles of thought, employed in their prompts. However, the interplay between these reasoning styles, model architecture, and task type remains poorly understood. To address this, we introduce StyleBench, a comprehensive benchmark for systematically evaluating reasoning styles across diverse tasks and models. We assess five representative reasoning styles, including Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT), and Chain-of-Draft (CoD) on five reasoning tasks, using 15 open-source models from major families (LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi, and DeepSeek) ranging from 270M to 120B parameters. Our large-scale analysis reveals that no single style is universally optimal. We demonstrate that strategy efficacy is highly contingent on both model scale and task type: search-based methods (AoT, ToT) excel in open-ended problems but require large-scale models, while concise styles (SoT, CoD) achieve radical efficiency gains on well-defined tasks. Furthermore, we identify key behavioral patterns: smaller models frequently fail to follow output instructions and default to guessing, while reasoning robustness emerges as a function of scale. Our findings offer a crucial roadmap for selecting optimal reasoning strategies based on specific constraints, we open source the benchmark in https://github.com/JamesJunyuGuo/Style_Bench.
PDF32September 26, 2025