压力测试机器生成文本检测：通过改变语言模型写作风格以欺骗检测器

摘要

近期，生成式人工智能（Generative AI）与大规模语言模型（LLMs）的进展，使得高度逼真的合成内容创作成为可能，这引发了关于其可能被恶意利用（如传播虚假信息和操纵舆论）的担忧。此外，由于缺乏评估真实场景下泛化能力的稳健基准，检测机器生成文本（MGT）仍面临挑战。在本研究中，我们提出了一套流程，用于测试当前最先进的MGT检测器（例如Mage、Radar、LLM-DetectAIve）在面对基于语言学知识的对抗攻击时的鲁棒性。为了挑战这些检测器，我们采用直接偏好优化（DPO）微调语言模型，使MGT风格向人类书写文本（HWT）靠拢。这一策略利用了检测器对风格线索的依赖，使得新生成的文本更难被识别。同时，我们分析了这种对齐过程引发的语言学变化，以及检测器用于识别MGT文本的特征。我们的研究结果表明，仅需少量样本即可轻易欺骗检测器，导致检测性能显著下降。这凸显了改进检测方法、增强其对未见领域文本的鲁棒性的重要性。

English

Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.

压力测试机器生成文本检测：通过改变语言模型写作风格以欺骗检测器

Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors

摘要

Support