推理的幻觉：通过零CoT截断揭露大语言模型中的规避性数据污染

摘要

大型语言模型（LLMs）在各类任务中展现出令人瞩目的推理能力，但数据污染问题严重影响了对其能力的客观评估。恶意模型发布者采用规避性（即间接）污染策略进一步加剧了这一问题，例如通过改写基准测试数据来规避现有检测方法，人为提升其在排行榜上的表现。当前方法难以可靠检测此类隐蔽污染。本研究发现了一个关键现象：模型生成的推理步骤会主动掩盖其潜在的记忆痕迹。受此启发，我们提出零思维链探测（Zero-CoT Probe, ZCP）——一种新型黑盒检测方法，通过刻意截断完整的思维链（Chain-of-Thought, CoT）过程来暴露潜在的捷径映射。为将记忆效应与模型内在的问题解决能力分离，ZCP对比了模型在原始基准测试与同构扰动参考数据集上的零CoT表现。此外，我们提出污染置信度（Contamination Confidence）指标，量化污染的可能性与严重程度，突破了简单的二元分类。在先前识别的污染模型与专门微调的污染模型上的大量实验表明，ZCP能够稳健检测直接污染与规避性数据污染。ZCP代码已开源：https://github.com/Yifan-Lan/zero-cot-probe。

English

Large language models (LLMs) have demonstrated impressive reasoning abilities across a wide range of tasks, but data contamination undermines the objective evaluation of these capabilities. This problem is further exacerbated by malicious model publishers who use evasive, or indirect, contamination strategies, such as paraphrasing benchmark data to evade existing detection methods and artificially boost leaderboard performance. Current approaches struggle to reliably detect such stealthy contamination. In this work, we uncover a critical phenomenon: a model's generated reasoning steps actively mask its underlying memorization. Inspired by this, we propose the Zero-CoT Probe (ZCP), a novel black-box detection method that deliberately truncates the entire Chain-of-Thought (CoT) process to expose latent shortcut mappings. To further isolate memorization from the model's intrinsic problem-solving capabilities, ZCP compares the model's zero-CoT performance on the original benchmark against an isomorphically perturbed reference dataset. Furthermore, we introduce Contamination Confidence, a metric that quantifies both the likelihood and severity of contamination, moving beyond simple binary classifications. Extensive experiments on both previously identified contaminated models and specially fine-tuned contaminated models demonstrate that ZCP robustly detects both direct and evasive data contamination. The code for ZCP is accessible at https://github.com/Yifan-Lan/zero-cot-probe.