StyleBench: 大規模言語モデルの思考スタイル評価

要旨

大規模言語モデル（LLMs）の効果は、プロンプトに用いられる推論戦略、すなわち思考スタイルに大きく影響されます。しかし、これらの思考スタイル、モデルアーキテクチャ、およびタスクタイプの相互作用は十分に理解されていません。これを解決するため、我々はStyleBenchを導入しました。これは、多様なタスクとモデルにわたる推論スタイルを体系的に評価するための包括的なベンチマークです。我々は、Chain of Thought（CoT）、Tree of Thought（ToT）、Algorithm of Thought（AoT）、Sketch of Thought（SoT）、Chain-of-Draft（CoD）という5つの代表的な推論スタイルを、5つの推論タスクで評価し、270Mから120Bパラメータまでの主要なファミリー（LLaMA、Qwen、Mistral、Gemma、GPT-OSS、Phi、DeepSeek）に属する15のオープンソースモデルを使用しました。大規模な分析により、単一のスタイルが普遍的に最適であるわけではないことが明らかになりました。戦略の有効性は、モデルの規模とタスクタイプに大きく依存することが示されています。探索ベースの手法（AoT、ToT）はオープンエンドの問題で優れていますが、大規模モデルを必要とし、一方で簡潔なスタイル（SoT、CoD）は明確に定義されたタスクで劇的な効率向上を達成します。さらに、重要な行動パターンを特定しました。小規模モデルは出力指示に従わず、推測に頼ることが多く、推論の堅牢性は規模の関数として現れます。我々の知見は、特定の制約に基づいて最適な推論戦略を選択するための重要なロードマップを提供し、ベンチマークをhttps://github.com/JamesJunyuGuo/Style_Benchでオープンソース化しました。

English

The effectiveness of Large Language Models (LLMs) is heavily influenced by the reasoning strategies, or styles of thought, employed in their prompts. However, the interplay between these reasoning styles, model architecture, and task type remains poorly understood. To address this, we introduce StyleBench, a comprehensive benchmark for systematically evaluating reasoning styles across diverse tasks and models. We assess five representative reasoning styles, including Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT), and Chain-of-Draft (CoD) on five reasoning tasks, using 15 open-source models from major families (LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi, and DeepSeek) ranging from 270M to 120B parameters. Our large-scale analysis reveals that no single style is universally optimal. We demonstrate that strategy efficacy is highly contingent on both model scale and task type: search-based methods (AoT, ToT) excel in open-ended problems but require large-scale models, while concise styles (SoT, CoD) achieve radical efficiency gains on well-defined tasks. Furthermore, we identify key behavioral patterns: smaller models frequently fail to follow output instructions and default to guessing, while reasoning robustness emerges as a function of scale. Our findings offer a crucial roadmap for selecting optimal reasoning strategies based on specific constraints, we open source the benchmark in https://github.com/JamesJunyuGuo/Style_Bench.

StyleBench: 大規模言語モデルの思考スタイル評価

StyleBench: Evaluating thinking styles in Large Language Models

要旨

Support