ベースモデルはAI検出器には人間のように見える

要旨

AI生成テキストが実世界に大規模に導入されるにつれ、教育や学術的誠実性のワークフローにおいて、機関は商用AIテキスト検出器をますます利用するようになっている。本稿では、こうしたシステムに関する驚くべき実証的知見を報告する。GPTZeroおよびPangramで評価した場合、ベースモデルによって生成されたテキストは圧倒的に人間らしいと判定されることが多いのに対し、指示チューニングされたモデル（instruction-tuned counterparts）によるテキストはそうではない。この観察に基づき、我々は反復パラフレーズによる人間化（Humanization by Iterative Paraphrasing, HIP）を提案する。これは検出器に依存しないパイプラインであり、ベースモデルを最小限のファインチューニングでパラフレーザーに変換し、それを反復的に適用する。検証したベースラインと比較して、HIPは商用検出器に対して意味保存と検出回避の間でより強力なトレードオフを実現する。Llama-3およびQwen-3ファミリーにおいて、モデルサイズ0.6Bから70Bにわたり、HIPは一貫して検出器における人間らしさ（human-likeness）を向上させる。我々の知見は、現在の検出器が、機械生成テキストの不変的な概念よりも、指示チューニングのアーティファクトや局所的な文脈を追跡していることを示唆している。このことは、これらの要因をより明示的にモデル化する検出器の設計を求めるものである。

English

As AI-generated text enters the real-world at scale, institutions increasingly use commercial AI-text detectors, especially in education and academic-integrity workflows. We report a surprising empirical finding about such systems: when evaluated by GPTZero and Pangram, generated text from base models is often judged overwhelmingly human, whereas text generated by their instruction-tuned counterparts is not. Building on this observation, we propose Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that minimally fine-tunes a base model into a paraphraser and applies it iteratively. Compared with the baselines we test, HIP yields a stronger trade-off between semantic preservation and detector evasion on commercial detectors. Across Llama-3 and Qwen-3 families, spanning model sizes from 0.6B to 70B, HIP consistently improves detector human-likeness. Our findings suggest that current detectors are tracking artifacts of instruction tuning and local context more than any invariant notion of machine-generated text. This, in turn, calls for detector designs that model these factors more explicitly.