추론의 환상: Zero-CoT 절단을 통한 LLM의 회피적 데이터 오염 폭로

초록

대규모 언어 모델(LLM)은 다양한 작업에서 인상적인 추론 능력을 입증했지만, 데이터 오염은 이러한 능력에 대한 객관적 평가를 저해한다. 이러한 문제는 기존 탐지 방법을 회피하고 리더보드 성능을 인위적으로 향상시키기 위해 벤치마크 데이터를 의역하는 등 회피적 혹은 간접적 오염 전략을 사용하는 악의적인 모델 게시자에 의해 더욱 악화된다. 현재의 접근 방식은 이러한 은밀한 오염을 신뢰성 있게 탐지하는 데 어려움을 겪는다. 본 연구에서는 모델이 생성한 추론 단계가 기저의 암기(기억)를 적극적으로 은폐한다는 중요한 현상을 밝혀낸다. 이에 영감을 받아, 우리는 사고 사슬(Chain-of-Thought, CoT) 전체 과정을 의도적으로 단절하여 잠재적인 지름길 매핑을 노출시키는 새로운 블랙박스 탐지 방법인 제로-CoT 프로브(Zero-CoT Probe, ZCP)를 제안한다. 또한, ZCP는 암기를 모델의 본질적인 문제 해결 능력으로부터 더욱 분리하기 위해, 원본 벤치마크에 대한 모델의 제로-CoT 성능을 동형으로 교란된 참조 데이터셋과 비교한다. 더 나아가, 단순한 이진 분류를 넘어 오염 가능성과 심각도를 모두 정량화하는 지표인 오염 신뢰도(Contamination Confidence)를 도입한다. 이전에 식별된 오염 모델과 특별히 미세 조정된 오염 모델 모두에 대한 광범위한 실험을 통해, ZCP가 직접적 및 회피적 데이터 오염을 강건하게 탐지함을 입증한다. ZCP 코드는 https://github.com/Yifan-Lan/zero-cot-probe 에서 확인할 수 있다.

English

Large language models (LLMs) have demonstrated impressive reasoning abilities across a wide range of tasks, but data contamination undermines the objective evaluation of these capabilities. This problem is further exacerbated by malicious model publishers who use evasive, or indirect, contamination strategies, such as paraphrasing benchmark data to evade existing detection methods and artificially boost leaderboard performance. Current approaches struggle to reliably detect such stealthy contamination. In this work, we uncover a critical phenomenon: a model's generated reasoning steps actively mask its underlying memorization. Inspired by this, we propose the Zero-CoT Probe (ZCP), a novel black-box detection method that deliberately truncates the entire Chain-of-Thought (CoT) process to expose latent shortcut mappings. To further isolate memorization from the model's intrinsic problem-solving capabilities, ZCP compares the model's zero-CoT performance on the original benchmark against an isomorphically perturbed reference dataset. Furthermore, we introduce Contamination Confidence, a metric that quantifies both the likelihood and severity of contamination, moving beyond simple binary classifications. Extensive experiments on both previously identified contaminated models and specially fine-tuned contaminated models demonstrate that ZCP robustly detects both direct and evasive data contamination. The code for ZCP is accessible at https://github.com/Yifan-Lan/zero-cot-probe.