대형 언어 모델을 위한 탐지 회피 기법

초록

대규모 언어 모델의 증가하는 인기는 널리 사용되는 것뿐만 아니라 가짜 뉴스를 체계적으로 확산시킬 가능성을 포함한 다양한 위험을 초래했습니다. 이에 따라 DetectGPT와 같은 분류 시스템의 개발이 중요해졌습니다. 이러한 탐지기는 실험 시리즈에서 입증된 바와 같이 회피 기술에 취약합니다. 생성 모델의 온도를 체계적으로 변경하는 것은 얕은 학습 기반 탐지기가 가장 신뢰할 수 없음을 입증했습니다. 강화 학습을 통해 생성 모델을 미세 조정하는 것은 BERT 기반 탐지기를 우회했습니다. 마지막으로, 문장을 재구성하는 것은 DetectGPT와 같은 제로샷 탐지기를 90% 이상 회피하게 했지만, 텍스트는 원본과 매우 유사하게 유지되었습니다. 기존 연구와의 비교는 제시된 방법들의 더 나은 성능을 강조합니다. 사회에 대한 가능한 영향과 추가 연구에 대해 논의합니다.

English

The increasing popularity of large language models has not only led to widespread use but has also brought various risks, including the potential for systematically spreading fake news. Consequently, the development of classification systems such as DetectGPT has become vital. These detectors are vulnerable to evasion techniques, as demonstrated in an experimental series: Systematic changes of the generative models' temperature proofed shallow learning-detectors to be the least reliable. Fine-tuning the generative model via reinforcement learning circumvented BERT-based-detectors. Finally, rephrasing led to a >90\% evasion of zero-shot-detectors like DetectGPT, although texts stayed highly similar to the original. A comparison with existing work highlights the better performance of the presented methods. Possible implications for society and further research are discussed.

대형 언어 모델을 위한 탐지 회피 기법

Detection Avoidance Techniques for Large Language Models

초록

Support