StyleBench: Valutazione degli stili di pensiero nei Modelli Linguistici di Grande Dimensione

Abstract

L'efficacia dei Large Language Models (LLM) è fortemente influenzata dalle strategie di ragionamento, o stili di pensiero, impiegati nei loro prompt. Tuttavia, l'interazione tra questi stili di ragionamento, l'architettura del modello e il tipo di task rimane poco compresa. Per affrontare questo problema, introduciamo StyleBench, un benchmark completo per valutare sistematicamente gli stili di ragionamento su task e modelli diversi. Valutiamo cinque stili di ragionamento rappresentativi, tra cui Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT) e Chain-of-Draft (CoD), su cinque task di ragionamento, utilizzando 15 modelli open-source delle principali famiglie (LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi e DeepSeek) con un numero di parametri compreso tra 270M e 120B. La nostra analisi su larga scala rivela che nessuno stile è universalmente ottimale. Dimostriamo che l'efficacia della strategia dipende fortemente sia dalla scala del modello che dal tipo di task: i metodi basati sulla ricerca (AoT, ToT) eccellono nei problemi aperti ma richiedono modelli di grandi dimensioni, mentre gli stili concisi (SoT, CoD) ottengono guadagni radicali in termini di efficienza su task ben definiti. Inoltre, identifichiamo alcuni pattern comportamentali chiave: i modelli più piccoli spesso non riescono a seguire le istruzioni di output e ricorrono a congetture, mentre la robustezza del ragionamento emerge come una funzione della scala. Le nostre scoperte offrono una roadmap cruciale per selezionare le strategie di ragionamento ottimali in base a vincoli specifici. Il benchmark è disponibile open source all'indirizzo https://github.com/JamesJunyuGuo/Style_Bench.

English

The effectiveness of Large Language Models (LLMs) is heavily influenced by the reasoning strategies, or styles of thought, employed in their prompts. However, the interplay between these reasoning styles, model architecture, and task type remains poorly understood. To address this, we introduce StyleBench, a comprehensive benchmark for systematically evaluating reasoning styles across diverse tasks and models. We assess five representative reasoning styles, including Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT), and Chain-of-Draft (CoD) on five reasoning tasks, using 15 open-source models from major families (LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi, and DeepSeek) ranging from 270M to 120B parameters. Our large-scale analysis reveals that no single style is universally optimal. We demonstrate that strategy efficacy is highly contingent on both model scale and task type: search-based methods (AoT, ToT) excel in open-ended problems but require large-scale models, while concise styles (SoT, CoD) achieve radical efficiency gains on well-defined tasks. Furthermore, we identify key behavioral patterns: smaller models frequently fail to follow output instructions and default to guessing, while reasoning robustness emerges as a function of scale. Our findings offer a crucial roadmap for selecting optimal reasoning strategies based on specific constraints, we open source the benchmark in https://github.com/JamesJunyuGuo/Style_Bench.

StyleBench: Valutazione degli stili di pensiero nei Modelli Linguistici di Grande Dimensione

StyleBench: Evaluating thinking styles in Large Language Models

Abstract

Support