基礎模型在AI檢測器眼中看似人類

摘要

隨著AI生成文本大規模進入現實世界，各機構——尤其在教育與學術誠信工作流程中——日益採用商用AI文本檢測器。我們報告一項關於此類系統的意外實證發現：經GPTZero與Pangram評估時，基礎模型所生成的文本往往被高度判定為人類所寫，而其經指令調校的對應模型所生成的文本則不然。基於此觀察，我們提出「迭代改寫人本化」（Humanization by Iterative Paraphrasing, HIP），這是一種無關檢測器的流程，能將基礎模型微調為改寫器，並反覆應用。與我們測試的基準方法相比，HIP在商用檢測器上取得更佳的語意保留與規避檢測之間的權衡。在Llama-3與Qwen-3系列中，涵蓋0.6B至70B的模型規模，HIP持續提升檢測器對人寫相似度的判斷。我們的發現表明，當前檢測器所追蹤的更多是指令調校與局部語境的痕跡，而非任何關於機器生成文本的不變概念。這進而呼籲檢測器的設計應更明確地對這些因素進行建模。

English

As AI-generated text enters the real-world at scale, institutions increasingly use commercial AI-text detectors, especially in education and academic-integrity workflows. We report a surprising empirical finding about such systems: when evaluated by GPTZero and Pangram, generated text from base models is often judged overwhelmingly human, whereas text generated by their instruction-tuned counterparts is not. Building on this observation, we propose Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that minimally fine-tunes a base model into a paraphraser and applies it iteratively. Compared with the baselines we test, HIP yields a stronger trade-off between semantic preservation and detector evasion on commercial detectors. Across Llama-3 and Qwen-3 families, spanning model sizes from 0.6B to 70B, HIP consistently improves detector human-likeness. Our findings suggest that current detectors are tracking artifacts of instruction tuning and local context more than any invariant notion of machine-generated text. This, in turn, calls for detector designs that model these factors more explicitly.