大型語言模型的思維鏈推理是海市蜃樓嗎？從數據分佈的視角探討

摘要

鏈式思考（Chain-of-Thought, CoT）提示法已被證實能提升大型語言模型（LLM）在多種任務上的表現。採用此方法時，LLM在給出答案前似乎會產生類似人類的推理步驟（即CoT推理），這常讓人誤以為它們進行了深思熟慮的推論過程。然而，一些初步研究指出，CoT推理可能比表面上看來更為膚淺，這促使我們進一步探索。本文中，我們從數據分佈的視角研究CoT推理，探討其是否反映了模型從分佈內數據中學習到的結構化歸納偏見，從而能夠條件性地生成近似於訓練期間所見的推理路徑。因此，其有效性根本上受制於訓練數據與測試查詢之間分佈差異的程度。基於這一視角，我們從任務、長度和格式三個維度剖析CoT推理。為探究每個維度，我們設計了DataAlchemy，這是一個孤立且受控的環境，用於從零開始訓練LLM，並在各種分佈條件下系統性地探測它們。我們的結果揭示，CoT推理是一種脆弱的幻象，一旦超出訓練分佈便會消失。這項工作深化了我們對CoT推理為何及何時失效的理解，強調了實現真正且可泛化推理的持續挑戰。

English

Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providing answers (a.k.a., CoT reasoning), which often leads to the perception that they engage in deliberate inferential processes. However, some initial findings suggest that CoT reasoning may be more superficial than it appears, motivating us to explore further. In this paper, we study CoT reasoning via a data distribution lens and investigate if CoT reasoning reflects a structured inductive bias learned from in-distribution data, allowing the model to conditionally generate reasoning paths that approximate those seen during training. Thus, its effectiveness is fundamentally bounded by the degree of distribution discrepancy between the training data and the test queries. With this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To investigate each dimension, we design DataAlchemy, an isolated and controlled environment to train LLMs from scratch and systematically probe them under various distribution conditions. Our results reveal that CoT reasoning is a brittle mirage that vanishes when it is pushed beyond training distributions. This work offers a deeper understanding of why and when CoT reasoning fails, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.

大型語言模型的思維鏈推理是海市蜃樓嗎？從數據分佈的視角探討

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

摘要

Support