StyleBench: 대형 언어 모델의 사고 스타일 평가

초록

대규모 언어 모델(LLM)의 효과성은 프롬프트에 사용된 추론 전략 또는 사고 스타일에 크게 영향을 받습니다. 그러나 이러한 추론 스타일, 모델 아키텍처, 그리고 작업 유형 간의 상호작용은 여전히 잘 이해되지 않고 있습니다. 이를 해결하기 위해, 우리는 다양한 작업과 모델에 걸쳐 추론 스타일을 체계적으로 평가하기 위한 포괄적인 벤치마크인 StyleBench를 소개합니다. 우리는 Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT), 그리고 Chain-of-Draft (CoD)와 같은 다섯 가지 대표적인 추론 스타일을 다섯 가지 추론 작업에서 평가하며, 270M에서 120B 파라미터에 이르는 주요 모델 패밀리(LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi, DeepSeek)의 15개 오픈소스 모델을 사용합니다. 우리의 대규모 분석은 단일 스타일이 보편적으로 최적이 아니라는 것을 보여줍니다. 우리는 전략의 효율성이 모델 규모와 작업 유형에 크게 의존한다는 것을 입증합니다: 탐색 기반 방법(AoT, ToT)은 개방형 문제에서 뛰어나지만 대규모 모델이 필요하며, 간결한 스타일(SoT, CoD)은 명확히 정의된 작업에서 극적인 효율성 향상을 달성합니다. 또한, 우리는 주요 행동 패턴을 식별합니다: 작은 모델은 출력 지시를 따르지 못하고 추측에 의존하는 경우가 많으며, 추론의 견고성은 규모의 함수로 나타납니다. 우리의 연구 결과는 특정 제약 조건에 기반하여 최적의 추론 전략을 선택하기 위한 중요한 로드맵을 제공하며, 벤치마크는 https://github.com/JamesJunyuGuo/Style_Bench에서 오픈소스로 공개됩니다.

English

The effectiveness of Large Language Models (LLMs) is heavily influenced by the reasoning strategies, or styles of thought, employed in their prompts. However, the interplay between these reasoning styles, model architecture, and task type remains poorly understood. To address this, we introduce StyleBench, a comprehensive benchmark for systematically evaluating reasoning styles across diverse tasks and models. We assess five representative reasoning styles, including Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT), and Chain-of-Draft (CoD) on five reasoning tasks, using 15 open-source models from major families (LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi, and DeepSeek) ranging from 270M to 120B parameters. Our large-scale analysis reveals that no single style is universally optimal. We demonstrate that strategy efficacy is highly contingent on both model scale and task type: search-based methods (AoT, ToT) excel in open-ended problems but require large-scale models, while concise styles (SoT, CoD) achieve radical efficiency gains on well-defined tasks. Furthermore, we identify key behavioral patterns: smaller models frequently fail to follow output instructions and default to guessing, while reasoning robustness emerges as a function of scale. Our findings offer a crucial roadmap for selecting optimal reasoning strategies based on specific constraints, we open source the benchmark in https://github.com/JamesJunyuGuo/Style_Bench.

StyleBench: 대형 언어 모델의 사고 스타일 평가

StyleBench: Evaluating thinking styles in Large Language Models

초록

Support