針對機器文本檢測器的攻擊仍保留風格指紋

摘要

尽管机器文本检测器的发展取得了显著进展，但机器文本极易被操控以躲避检测，这导致有人提出该问题本质上是棘手的。在本文中，我们研究了此类规避策略的局限性。我们证明，虽然当前从提示工程到检测器引导优化等攻击手段能有效降低标准检测器的性能，但它们无法抹去机器文本底层固有的风格“指纹”。我们发现，利用风格特征空间的少样本检测器能够抵御这些规避尝试，即使是针对那些经过明确调校以躲避检测的模型所生成的样本，也能可靠地识别。这引发了一个思考：风格是否代表了一种针对机器检测攻击的通用防御手段？我们通过引入一种新型改写方法证明答案为“否”，该方法同时优化了不可检测性与对特定人类风格的遵循。我们表明，与先前方法不同，此种攻击能有效规避所有被考虑的检测器，包括那些利用写作风格的检测器。然而，我们发现这种规避并非绝对：随着可供分析的文档数量增加，人类与机器文本的分布再次变得可区分。总体而言，我们的研究结果表明，可靠的机器文本检测需要从单文档分析转向多文档分析。

English

Despite considerable progress in the development of machine-text detectors, the ease with which machine-text can be manipulated to evade detection has led to suggestions that the problem is inherently intractable. In this work, we investigate the limits of such evasion strategies. We demonstrate that while current attacks, ranging from prompt engineering to detector-guided optimization can effectively degrade performance of standard detectors, they fail to erase the underlying stylistic "fingerprints" of machine text. We show that few-shot detectors that utilize the stylistic feature space are robust to these evasion attempts, reliably detecting samples even from models explicitly tuned to prevent detection. This raises the question: does style represent a universal defense against machine-detection attacks? We demonstrate that the answer is "no'' by introducing a novel paraphrasing approach that simultaneously optimizes for undetectability and adherence to specific human styles. We show that unlike prior methods, this attack effectively evades all considered detectors, including those that utilize writing style. However, we find that this evasion is not absolute: as the number of documents available for analysis grows, the human and machine distributions become distinguishable again. Overall, our findings suggest that reliable machine-text detection requires moving beyond single-document analysis to multi-document analysis.