기계 텍스트 탐지기에 대한 공격은 문체적 지문을 유지한다

초록

기계 텍스트 탐지기의 개발이 상당한 진전을 이루었음에도 불구하고, 탐지를 회피하기 위해 기계 텍스트를 조작하는 것이 용이하다는 점은 이 문제가 본질적으로 다루기 어렵다는 제기로 이어졌다. 본 연구에서는 이러한 회피 전략의 한계를 조사한다. 우리는 프롬프트 엔지니어링에서 탐지기 기반 최적화에 이르기까지 현재의 공격이 표준 탐지기의 성능을 효과적으로 저하시킬 수 있지만, 기계 텍스트의 근본적인 문체적 '지문'을 지우지는 못함을 입증한다. 문체적 특징 공간을 활용하는 퓨샷 탐지기는 이러한 회피 시도에 강건하며, 탐지를 방지하도록 명시적으로 조정된 모델의 샘플조차도 신뢰성 있게 탐지함을 보여준다. 이는 다음과 같은 질문을 제기한다: 문체가 기계 탐지 공격에 대한 보편적 방어책을 제공하는가? 우리는 탐지 불가능성과 특정 인간 문체에 대한 준수를 동시에 최적화하는 새로운 의역 접근법을 도입함으로써 그 답이 '아니오'임을 입증한다. 이 공격은 기존 방법과 달리 문체를 활용하는 탐지기를 포함한 모든 고려된 탐지기를 효과적으로 회피함을 보여준다. 그러나 이러한 회피가 절대적이지는 않음을 발견한다: 분석 가능한 문서의 수가 증가함에 따라 인간과 기계 분포는 다시 구별 가능해진다. 전반적으로 본 연구의 결과는 신뢰할 수 있는 기계 텍스트 탐지를 위해서는 단일 문서 분석을 넘어 다중 문서 분석으로 나아가야 함을 시사한다.

English

Despite considerable progress in the development of machine-text detectors, the ease with which machine-text can be manipulated to evade detection has led to suggestions that the problem is inherently intractable. In this work, we investigate the limits of such evasion strategies. We demonstrate that while current attacks, ranging from prompt engineering to detector-guided optimization can effectively degrade performance of standard detectors, they fail to erase the underlying stylistic "fingerprints" of machine text. We show that few-shot detectors that utilize the stylistic feature space are robust to these evasion attempts, reliably detecting samples even from models explicitly tuned to prevent detection. This raises the question: does style represent a universal defense against machine-detection attacks? We demonstrate that the answer is "no'' by introducing a novel paraphrasing approach that simultaneously optimizes for undetectability and adherence to specific human styles. We show that unlike prior methods, this attack effectively evades all considered detectors, including those that utilize writing style. However, we find that this evasion is not absolute: as the number of documents available for analysis grows, the human and machine distributions become distinguishable again. Overall, our findings suggest that reliable machine-text detection requires moving beyond single-document analysis to multi-document analysis.