LLM의 사고 사슬(Chain-of-Thought) 추론은 환상인가? 데이터 분포 관점에서의 탐구

초록

Chain-of-Thought(CoT) 프롬프팅은 다양한 작업에서 대형 언어 모델(LLM)의 성능을 향상시키는 것으로 나타났습니다. 이 접근법을 통해 LLM은 답변을 제공하기 전에 인간과 유사한 추론 단계를 생성하는 것처럼 보이며(이를 CoT 추론이라고 함), 이는 종종 모델이 의도적인 추론 과정을 거치는 것처럼 인식되게 합니다. 그러나 일부 초기 연구 결과에 따르면 CoT 추론이 겉보기보다 더 피상적일 가능성이 있어, 이를 더 깊이 탐구할 필요가 있습니다. 본 논문에서는 데이터 분포 관점에서 CoT 추론을 연구하고, CoT 추론이 학습 데이터 내에서 학습된 구조화된 귀납적 편향을 반영하여 모델이 훈련 중에 본 추론 경로를 조건부로 생성할 수 있는지 조사합니다. 따라서 그 효과는 근본적으로 훈련 데이터와 테스트 쿼리 간의 분포 차이 정도에 의해 제한됩니다. 이러한 관점에서 우리는 CoT 추론을 작업(task), 길이(length), 형식(format)이라는 세 가지 차원으로 분석합니다. 각 차원을 조사하기 위해, 우리는 DataAlchemy라는 독립적이고 통제된 환경을 설계하여 LLM을 처음부터 훈련시키고 다양한 분포 조건에서 체계적으로 탐구합니다. 우리의 결과는 CoT 추론이 훈련 분포를 벗어나면 깨지기 쉬운 신기루처럼 사라지는 것을 보여줍니다. 이 연구는 CoT 추론이 왜 그리고 언제 실패하는지에 대한 더 깊은 이해를 제공하며, 진정하고 일반화 가능한 추론을 달성하는 데 지속적인 도전이 남아 있음을 강조합니다.

English

Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.

LLM의 사고 사슬(Chain-of-Thought) 추론은 환상인가? 데이터 분포 관점에서의 탐구

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

초록

Support