推論の幻想：Zero-CoTトランケーションによるLLMにおける回避的データ汚染の暴露

要旨

大規模言語モデル（LLM）は、多岐にわたるタスクにおいて顕著な推論能力を示してきたが、データ汚染がこれらの能力の客観的評価を損なっている。この問題は、悪意あるモデル公開者によってさらに悪化しており、彼らは既存の検出手法を回避してリーダーボードのパフォーマンスを人工的に向上させるために、ベンチマークデータを言い換えるなどの回避的、あるいは間接的な汚染戦略を採用している。現在の手法では、このような巧妙な汚染を確実に検出することは困難である。本研究では、モデルが生成する推論ステップが、その背後にある記憶（memorization）を積極的に隠蔽するという重要な現象を明らかにする。これに着想を得て、我々はZero-CoT Probe（ZCP）を提案する。これは、チェーン・オブ・ソート（CoT）プロセス全体を意図的に打ち切ることで、潜在的な近道写像（shortcut mapping）を露呈させる、新しいブラックボックス検出手法である。さらに、記憶をモデルの本来の問題解決能力から分離するために、ZCPは元のベンチマークにおけるモデルのゼロCoTパフォーマンスと、同型に摂動を加えた参照データセットにおけるそれを比較する。また、単純な二値分類を超えて、汚染の可能性とその深刻度の両方を定量化する指標、Contamination Confidenceを導入する。既に特定された汚染モデルと、特別にファインチューニングされた汚染モデルの両方を用いた広範な実験により、ZCPが直接的なデータ汚染と回避的なデータ汚染の両方を頑健に検出できることが示された。ZCPのコードはhttps://github.com/Yifan-Lan/zero-cot-probe で公開されている。

English

Large language models (LLMs) have demonstrated impressive reasoning abilities across a wide range of tasks, but data contamination undermines the objective evaluation of these capabilities. This problem is further exacerbated by malicious model publishers who use evasive, or indirect, contamination strategies, such as paraphrasing benchmark data to evade existing detection methods and artificially boost leaderboard performance. Current approaches struggle to reliably detect such stealthy contamination. In this work, we uncover a critical phenomenon: a model's generated reasoning steps actively mask its underlying memorization. Inspired by this, we propose the Zero-CoT Probe (ZCP), a novel black-box detection method that deliberately truncates the entire Chain-of-Thought (CoT) process to expose latent shortcut mappings. To further isolate memorization from the model's intrinsic problem-solving capabilities, ZCP compares the model's zero-CoT performance on the original benchmark against an isomorphically perturbed reference dataset. Furthermore, we introduce Contamination Confidence, a metric that quantifies both the likelihood and severity of contamination, moving beyond simple binary classifications. Extensive experiments on both previously identified contaminated models and specially fine-tuned contaminated models demonstrate that ZCP robustly detects both direct and evasive data contamination. The code for ZCP is accessible at https://github.com/Yifan-Lan/zero-cot-probe.