CoTを使うべきかどうか？CoTは主に数学と象徴的推論に役立ちます。

要旨

プロンプトを介した連想（CoT）は、大規模言語モデル（LLM）から推論能力を引き出すための事実上の方法です。しかし、この追加の「思考」が本当にどの種類のタスクに役立つのでしょうか？これを分析するために、CoTを使用した100以上の論文をカバーする定量的なメタ分析を実施し、14つのモデルにわたる20のデータセットの独自評価を行いました。結果は、CoTが主に数学や論理を含むタスクにおいて強力な性能向上をもたらし、他のタイプのタスクではずっと小さな利益しかもたらさないことを示しています。MMLUでは、CoTなしで回答を直接生成すると、問題やモデルの応答に等号が含まれている場合を除いて、CoTとほぼ同じ精度が得られます。これは、象徴的な操作と推論を示しています。この発見に基づき、計画と実行を分離し、ツールで補助されたLLMと比較することで、これらの問題におけるCoTの振る舞いを分析します。CoTの多くの利益は象徴的な実行の改善によるものですが、象徴的なソルバーを使用する場合に比べて性能が劣っています。結果から、CoTは選択的に適用でき、推論コストを節約しながら性能を維持できることが示されます。さらに、プロンプトベースのCoTを超えて、LLMアプリケーション全体の中間計算をよりよく活用する新しいパラダイムへの移行が必要であることを示唆しています。

English

Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

CoTを使うべきかどうか？CoTは主に数学と象徴的推論に役立ちます。

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

要旨

Support