기반 모델은 AI 탐지기에 인간처럼 보인다

초록

AI 생성 텍스트가 실제 환경에 대규모로 유입됨에 따라, 특히 교육 및 학문적 무결성 워크플로에서 기관들이 상용 AI 텍스트 탐지기를 점점 더 많이 사용하고 있습니다. 우리는 이러한 시스템에 대한 놀라운 실증적 발견을 보고합니다: GPTZero와 Pangram으로 평가했을 때, 기본 모델에서 생성된 텍스트는 대개 압도적으로 인간적인 것으로 판단되는 반면, 명령어 튜닝된 대응 모델에서 생성된 텍스트는 그렇지 않았습니다. 이 관찰을 바탕으로, 우리는 반복적 의역을 통한 인간화(HIP)를 제안합니다. 이는 탐지기 무관 파이프라인으로, 기본 모델을 최소한으로 미세 조정하여 의역기로 만든 후 이를 반복적으로 적용합니다. 우리가 테스트한 기준선과 비교하여, HIP는 상용 탐지기에 대해 의미 보존과 탐지기 회피 사이에서 더 강력한 균형을 제공합니다. Llama-3와 Qwen-3 제품군, 0.6B에서 70B에 이르는 모델 크기에 걸쳐, HIP는 탐지기 인간 유사성을 일관되게 개선합니다. 우리의 발견은 현재 탐지기들이 기계 생성 텍스트의 불변 개념보다는 명령어 튜닝의 인공물과 지역적 맥락을 추적하고 있음을 시사합니다. 이는 결과적으로 이러한 요소들을 더 명시적으로 모델링하는 탐지기 설계를 요구합니다.

English

As AI-generated text enters the real-world at scale, institutions increasingly use commercial AI-text detectors, especially in education and academic-integrity workflows. We report a surprising empirical finding about such systems: when evaluated by GPTZero and Pangram, generated text from base models is often judged overwhelmingly human, whereas text generated by their instruction-tuned counterparts is not. Building on this observation, we propose Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that minimally fine-tunes a base model into a paraphraser and applies it iteratively. Compared with the baselines we test, HIP yields a stronger trade-off between semantic preservation and detector evasion on commercial detectors. Across Llama-3 and Qwen-3 families, spanning model sizes from 0.6B to 70B, HIP consistently improves detector human-likeness. Our findings suggest that current detectors are tracking artifacts of instruction tuning and local context more than any invariant notion of machine-generated text. This, in turn, calls for detector designs that model these factors more explicitly.