针对机器文本检测器的攻击仍保留风格指纹

摘要

尽管机器文本检测器的发展取得了显著进展，但机器文本易于被操纵以规避检测的现象，引发了该问题本质上是难以解决的讨论。本研究探讨了此类规避策略的局限性。我们证明，尽管从提示工程到检测器引导优化等现有攻击手段能有效削弱标准检测器的性能，但它们无法消除机器文本底层的“风格指纹”特征。我们进一步表明，利用风格特征空间的少样本检测器对这些规避尝试具有鲁棒性，即使是针对为规避检测而明确调整过的模型生成的样本，也能可靠地识别。这引发了一个问题：风格是否构成了对抗机器检测攻击的普适性防御？我们通过引入一种新颖的重写方法，证明答案是“否”——该方法同时优化了不可检测性与对人类特定风格的遵循。研究表明，与先前方法不同，这种攻击手段能有效规避所有被考虑的检测器，包括那些利用写作风格的检测器。然而，我们发现这种规避并非绝对：随着可供分析的文档数量增加，人类文本与机器文本的分布再次变得可区分。总体而言，我们的发现表明，可靠的机器文本检测需要从单文档分析转向多文档分析。

English

Despite considerable progress in the development of machine-text detectors, the ease with which machine-text can be manipulated to evade detection has led to suggestions that the problem is inherently intractable. In this work, we investigate the limits of such evasion strategies. We demonstrate that while current attacks, ranging from prompt engineering to detector-guided optimization can effectively degrade performance of standard detectors, they fail to erase the underlying stylistic "fingerprints" of machine text. We show that few-shot detectors that utilize the stylistic feature space are robust to these evasion attempts, reliably detecting samples even from models explicitly tuned to prevent detection. This raises the question: does style represent a universal defense against machine-detection attacks? We demonstrate that the answer is "no'' by introducing a novel paraphrasing approach that simultaneously optimizes for undetectability and adherence to specific human styles. We show that unlike prior methods, this attack effectively evades all considered detectors, including those that utilize writing style. However, we find that this evasion is not absolute: as the number of documents available for analysis grows, the human and machine distributions become distinguishable again. Overall, our findings suggest that reliable machine-text detection requires moving beyond single-document analysis to multi-document analysis.