大语言模型的链式思维推理是海市蜃楼吗?从数据分布的视角审视
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
August 2, 2025
作者: Chengshuai Zhao, Zhen Tan, Pingchuan Ma, Dawei Li, Bohan Jiang, Yancheng Wang, Yingzhen Yang, Huan Liu
cs.AI
摘要
链式思维(CoT)提示法已被证明能够提升大型语言模型(LLM)在多种任务上的表现。采用这种方法时,LLM在给出答案前似乎会生成类似人类的推理步骤(即CoT推理),这常常让人误以为它们在进行有意识的推断过程。然而,一些初步研究指出,CoT推理可能比表面看起来更为肤浅,这促使我们进一步探索。本文中,我们通过数据分布的视角研究CoT推理,探讨其是否反映了从分布内数据中学习到的结构化归纳偏差,使得模型能够条件性地生成近似于训练期间所见推理路径的推理步骤。因此,其有效性从根本上受限于训练数据与测试查询之间分布差异的程度。基于这一视角,我们从任务、长度和格式三个维度剖析CoT推理。为探究每一维度,我们设计了DataAlchemy,一个隔离且受控的环境,用于从头训练LLM,并在不同分布条件下系统性地探测它们。我们的研究结果表明,CoT推理是一种脆弱的幻象,一旦超出训练分布范围便会消失。这项工作深化了我们对CoT推理为何及何时失效的理解,强调了实现真正且可泛化推理的持续挑战。
English
Chain-of-Thought (CoT) prompting has been shown to improve Large Language
Model (LLM) performance on various tasks. With this approach, LLMs appear to
produce human-like reasoning steps before providing answers (a.k.a., CoT
reasoning), which often leads to the perception that they engage in deliberate
inferential processes. However, some initial findings suggest that CoT
reasoning may be more superficial than it appears, motivating us to explore
further. In this paper, we study CoT reasoning via a data distribution lens and
investigate if CoT reasoning reflects a structured inductive bias learned from
in-distribution data, allowing the model to conditionally generate reasoning
paths that approximate those seen during training. Thus, its effectiveness is
fundamentally bounded by the degree of distribution discrepancy between the
training data and the test queries. With this lens, we dissect CoT reasoning
via three dimensions: task, length, and format. To investigate each dimension,
we design DataAlchemy, an isolated and controlled environment to train LLMs
from scratch and systematically probe them under various distribution
conditions. Our results reveal that CoT reasoning is a brittle mirage that
vanishes when it is pushed beyond training distributions. This work offers a
deeper understanding of why and when CoT reasoning fails, emphasizing the
ongoing challenge of achieving genuine and generalizable reasoning.