機械テキスト検出器への攻撃は文体指紋を保持する

要旨

機械テキスト検出器の開発においてはかなりの進展が見られたものの、機械テキストが検出を逃れるために容易に操作されうることから、この問題は本質的に解決不可能であるとの見解が示されている。本研究では、こうした回避戦略の限界を探る。我々は、プロンプトエンジニアリングから検出器誘導型最適化に至る現在の攻撃が標準的な検出器の性能を効果的に低下させ得る一方で、機械テキストの根底にある文体上の「指紋」を消去することはできないことを実証する。文体特徴空間を利用する数ショット検出器は、これらの回避試行に対して頑健であり、検出を防止するために明示的に調整されたモデルからのサンプルであっても確実に検出することを示す。このことから、文体は機械検出攻撃に対する普遍的な防御策となるのかという疑問が生じる。我々は、検出不可能性と特定の人間の文体への忠実性を同時に最適化する新たな言い換え手法を導入することで、その答えが「否」であることを実証する。先行手法とは異なり、この攻撃は文体を利用するものを含むすべての対象検出器を効果的に回避することを示す。しかしながら、この回避は絶対的なものではないことがわかる。分析に利用可能な文書数が増加するにつれて、人間と機械の分布は再び区別可能となる。全体として、我々の知見は、信頼性の高い機械テキスト検出には単一文書分析から複数文書分析への移行が必要であることを示唆している。

English

Despite considerable progress in the development of machine-text detectors, the ease with which machine-text can be manipulated to evade detection has led to suggestions that the problem is inherently intractable. In this work, we investigate the limits of such evasion strategies. We demonstrate that while current attacks, ranging from prompt engineering to detector-guided optimization can effectively degrade performance of standard detectors, they fail to erase the underlying stylistic "fingerprints" of machine text. We show that few-shot detectors that utilize the stylistic feature space are robust to these evasion attempts, reliably detecting samples even from models explicitly tuned to prevent detection. This raises the question: does style represent a universal defense against machine-detection attacks? We demonstrate that the answer is "no'' by introducing a novel paraphrasing approach that simultaneously optimizes for undetectability and adherence to specific human styles. We show that unlike prior methods, this attack effectively evades all considered detectors, including those that utilize writing style. However, we find that this evasion is not absolute: as the number of documents available for analysis grows, the human and machine distributions become distinguishable again. Overall, our findings suggest that reliable machine-text detection requires moving beyond single-document analysis to multi-document analysis.