機械生成テキスト検出のストレステスト：検出器を欺くための言語モデルの執筆スタイルの変更

要旨

近年の生成AIと大規模言語モデル（LLMs）の進展により、非常にリアルな合成コンテンツの作成が可能となり、誤情報や操作などの悪用の可能性に対する懸念が高まっている。さらに、機械生成テキスト（MGT）の検出は、現実世界のシナリオへの一般化を評価する堅牢なベンチマークの欠如により、依然として困難である。本研究では、最先端のMGT検出器（例：Mage、Radar、LLM-DetectAIve）の耐性を、言語学的に考慮された敵対的攻撃に対してテストするためのパイプラインを提案する。検出器を挑戦するために、Direct Preference Optimization（DPO）を使用して言語モデルを微調整し、MGTのスタイルを人間が書いたテキスト（HWT）に近づける。これにより、検出器が依存するスタイルの手がかりを利用し、新たに生成されたテキストの検出をより困難にする。さらに、アラインメントによって引き起こされる言語的変化と、検出器がMGTテキストを検出するために使用する特徴を分析する。我々の結果は、比較的少数の例で検出器を容易に欺くことができ、検出性能が大幅に低下することを示している。これは、検出方法を改善し、未見のドメインテキストに対して堅牢にする重要性を強調している。

English

Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.

機械生成テキスト検出のストレステスト：検出器を欺くための言語モデルの執筆スタイルの変更

Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors

要旨

Support