大規模言語モデルの連鎖的思考推論は幻影か？データ分布の観点から

要旨

Chain-of-Thought（CoT）プロンプティングは、大規模言語モデル（LLM）のさまざまなタスクにおける性能向上に寄与することが示されています。このアプローチでは、LLMは回答を提供する前に人間のような推論ステップを生成するように見えます（いわゆるCoT推論）。これにより、LLMが意図的な推論プロセスを行っているかのように認識されることが多いです。しかし、初期の研究結果から、CoT推論は見かけほど深くない可能性が示唆されており、さらなる探求を促しています。本論文では、データ分布の観点からCoT推論を研究し、CoT推論が学習データ内の構造化された帰納的バイアスを反映しているかどうかを調査します。これにより、モデルは訓練中に見られた推論パスを条件付きで生成できるようになり、その有効性は訓練データとテストクエリ間の分布の不一致の度合いに根本的に制約されます。この観点から、CoT推論をタスク、長さ、形式の3つの次元で分析します。各次元を調査するために、LLMをゼロから訓練し、さまざまな分布条件下で体系的に探るための孤立した制御環境であるDataAlchemyを設計します。我々の結果は、CoT推論が訓練分布を超えると脆い幻影となり消えてしまうことを明らかにします。この研究は、CoT推論がなぜ、いつ失敗するのかを深く理解し、真の汎用的な推論を達成するための継続的な課題を強調します。

English

Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.

大規模言語モデルの連鎖的思考推論は幻影か？データ分布の観点から

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

要旨

Support