ChatPaper.aiChatPaper

思维形态:当推理任务中分布比正确性更重要

Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks

December 24, 2025
作者: Abhranil Chandra, Ayush Agrawal, Arian Hosseini, Sebastian Fischmeister, Rishabh Agarwal, Navin Goyal, Aaron Courville
cs.AI

摘要

我们意外发现,即使所有思维链追踪都指向错误答案,通过使用更强模型生成的思维链追踪合成数据集进行训练,仍能提升语言模型的推理能力。实验表明,该方法在推理任务上的表现优于基于人工标注数据集的训练。我们推测两个关键因素可解释此现象:首先,合成数据的分布本质上更接近语言模型自身的分布,从而更易于学习;其次,这些"错误"追踪往往仅存在部分缺陷,其中包含模型可借鉴的有效推理步骤。为验证第一点,我们使用语言模型对人工标注的思维链进行复述——使其分布更接近模型自身分布——结果表明该方法能提升性能。针对第二点,我们引入缺陷程度递增的思维链追踪,研究模型对这些缺陷的容忍度。我们在数学、算法推理和代码生成等多个推理领域(使用MATH、GSM8K、Countdown和MBPP数据集),基于Qwen、Llama和Gemma等系列的1.5B至9B参数规模语言模型验证了上述发现。研究表明,构建更接近模型分布的数据集是值得关注的关键要素。我们还发现,正确答案并非总是可靠推理过程的指标。
English
We present the surprising finding that a language model's reasoning capabilities can be improved by training on synthetic datasets of chain-of-thought (CoT) traces from more capable models, even when all of those traces lead to an incorrect final answer. Our experiments show this approach can yield better performance on reasoning tasks than training on human-annotated datasets. We hypothesize that two key factors explain this phenomenon: first, the distribution of synthetic data is inherently closer to the language model's own distribution, making it more amenable to learning. Second, these `incorrect' traces are often only partially flawed and contain valid reasoning steps from which the model can learn. To further test the first hypothesis, we use a language model to paraphrase human-annotated traces -- shifting their distribution closer to the model's own distribution -- and show that this improves performance. For the second hypothesis, we introduce increasingly flawed CoT traces and study to what extent models are tolerant to these flaws. We demonstrate our findings across various reasoning domains like math, algorithmic reasoning and code generation using MATH, GSM8K, Countdown and MBPP datasets on various language models ranging from 1.5B to 9B across Qwen, Llama, and Gemma models. Our study shows that curating datasets that are closer to the model's distribution is a critical aspect to consider. We also show that a correct final answer is not always a reliable indicator of a faithful reasoning process.
PDF11December 31, 2025