基础模型在AI检测器看来如同人类

摘要

随着AI生成文本大规模进入现实世界，各类机构，尤其是在教育和学术诚信工作流程中，越来越多地使用商用AI文本检测器。我们报告了一个关于此类系统令人惊讶的实证发现：当由GPTZero和Pangram评估时，基础模型生成的文本通常被判定为极似人类，而它们经过指令微调的版本生成的文本则不然。基于这一观察，我们提出了基于迭代释义的人类化方法（HIP），这是一种与检测器无关的流程，它对基础模型进行最小程度的微调以成为释义器，并迭代应用。与我们测试的基线方法相比，HIP在商用检测器上实现了语义保留与规避检测之间更强的权衡。在Llama-3和Qwen-3系列中，跨越0.6B到70B的模型规模，HIP持续提升了检测器评估的类人程度。我们的发现表明，当前检测器更多追踪指令微调和局部上下文的痕迹，而非机器生成文本的任何不变概念。这进而要求检测器设计能更显式地建模这些因素。

English

As AI-generated text enters the real-world at scale, institutions increasingly use commercial AI-text detectors, especially in education and academic-integrity workflows. We report a surprising empirical finding about such systems: when evaluated by GPTZero and Pangram, generated text from base models is often judged overwhelmingly human, whereas text generated by their instruction-tuned counterparts is not. Building on this observation, we propose Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that minimally fine-tunes a base model into a paraphraser and applies it iteratively. Compared with the baselines we test, HIP yields a stronger trade-off between semantic preservation and detector evasion on commercial detectors. Across Llama-3 and Qwen-3 families, spanning model sizes from 0.6B to 70B, HIP consistently improves detector human-likeness. Our findings suggest that current detectors are tracking artifacts of instruction tuning and local context more than any invariant notion of machine-generated text. This, in turn, calls for detector designs that model these factors more explicitly.