壓力測試機器生成文本檢測：轉變語言模型寫作風格以欺騙檢測器

摘要

近期，生成式人工智慧（Generative AI）與大型語言模型（Large Language Models, LLMs）的進展，使得高度逼真的合成內容得以創建，這引發了對其可能被惡意利用（如散佈錯誤資訊與操縱輿論）的擔憂。此外，由於缺乏評估真實場景下泛化能力的穩健基準，檢測機器生成文本（Machine-Generated Text, MGT）仍具挑戰性。在本研究中，我們提出了一種流程，用於測試當前最先進的MGT檢測器（例如Mage、Radar、LLM-DetectAIve）對基於語言學的對抗攻擊的韌性。為了挑戰這些檢測器，我們利用直接偏好優化（Direct Preference Optimization, DPO）對語言模型進行微調，將MGT的風格轉向人類書寫文本（Human-Written Text, HWT）。此方法利用了檢測器對風格線索的依賴性，使得新生成的文本更難被檢測。此外，我們分析了對齊過程中所誘導的語言轉變，以及檢測器用於識別MGT文本的特徵。我們的結果顯示，檢測器在面對相對少量的示例時容易被欺騙，導致檢測性能顯著下降。這凸顯了改進檢測方法並使其對未見過的領域內文本具有魯棒性的重要性。

English

Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.

壓力測試機器生成文本檢測：轉變語言模型寫作風格以欺騙檢測器

Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors

摘要

Support