기계 생성 텍스트 탐지의 스트레스 테스트: 탐지기를 속이기 위한 언어 모델의 글쓰기 스타일 전환

초록

최근 생성형 인공지능(Generative AI)과 대형 언어 모델(Large Language Models, LLMs)의 발전으로 인해 매우 사실적인 합성 콘텐츠 생성이 가능해졌으며, 이는 허위 정보와 조작과 같은 악의적 사용 가능성에 대한 우려를 불러일으키고 있다. 또한, 기계 생성 텍스트(Machine-Generated Text, MGT)를 탐지하는 것은 실세계 시나리오에 대한 일반화를 평가할 수 있는 강력한 벤치마크의 부재로 인해 여전히 어려운 과제로 남아 있다. 본 연구에서는 최신 MGT 탐지기(예: Mage, Radar, LLM-DetectAIve)가 언어학적으로 고안된 적대적 공격에 대해 얼마나 견고한지를 테스트하기 위한 파이프라인을 제시한다. 탐지기를 더욱 도전적으로 만들기 위해, 직접 선호 최적화(Direct Preference Optimization, DPO)를 사용하여 언어 모델을 미세 조정하여 MGT 스타일을 인간 작성 텍스트(Human-Written Text, HWT)에 가깝게 전환한다. 이는 탐지기가 스타일적 단서에 의존하는 특성을 이용하여 새로운 생성물을 탐지하기 더욱 어렵게 만든다. 또한, 정렬 과정에서 유도된 언어적 변화와 탐지기가 MGT 텍스트를 탐지하기 위해 사용하는 특징을 분석한다. 우리의 실험 결과는 탐지기가 상대적으로 적은 수의 예시로도 쉽게 속아 탐지 성능이 크게 저하될 수 있음을 보여준다. 이는 탐지 방법을 개선하고, 보이지 않는 도메인 내 텍스트에 대해 견고하게 만드는 것의 중요성을 강조한다.

English

Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we present a pipeline to test the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. To challenge the detectors, we fine-tune language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT). This exploits the detectors' reliance on stylistic clues, making new generations more challenging to detect. Additionally, we analyze the linguistic shifts induced by the alignment and which features are used by detectors to detect MGT texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detection performance. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts.

기계 생성 텍스트 탐지의 스트레스 테스트: 탐지기를 속이기 위한 언어 모델의 글쓰기 스타일 전환

Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors

초록

Support