생성 AI 시대의 일반화 가능한 저자 식별을 위한 설명 가능한 분리 표현 학습

초록

저자 스타일의 강건한 표현 학습은 저자 귀속 및 AI 생성 텍스트 탐지에 중요합니다. 그러나 기존 방법들은 내용과 스타일의 혼재 문제로 인해 어려움을 겪는데, 이는 모델이 저자의 작성 스타일과 주제 간 허위 상관관계를 학습하여 도메인 간 일반화 성능이 저하되기 때문입니다. 이러한 문제를 해결하기 위해 우리는 설계 단계에서 구조적 분리를 통해 스타일과 내용을 명시적으로 분리하는 새로운 프레임워크인 설명 가능 저자 변분 자동인코더(EAVAE)를 제안합니다. EAVAE는 먼저 다양한 저자 데이터에 대한 지도 대조 학습을 사용해 스타일 인코더를 사전 학습한 후, 스타일과 내용 표현을 위한 별도의 인코더를 사용하는 변분 자동인코더(VEA) 아키텍처로 미세 조정합니다. 분리는 스타일/내용 표현 쌍이 동일한 또는 서로 다른 저자/내용 출처에 속하는지 구분할 뿐만 아니라 해당 결정에 대한 자연어 설명을 생성하는 새로운 판별기를 통해 강화되어, 혼란 정보를 완화하고 동시에 해석 가능성을 향상시킵니다. 광범위한 실험을 통해 EAVAE의 효과성을 입증했습니다. 저자 귀속 작업에서는 Amazon Reviews, PAN21, HRS 등 다양한 데이터셋에서 최첨단 성능을 달성했습니다. AI 생성 텍스트 탐지에서는 M4 데이터셋에 대한 few-shot 학습에서 탁월한 성능을 보였습니다. 코드와 데이터 저장소는 온라인에서 이용 가능합니다.

English

Learning robust representations of authorial style is crucial for authorship attribution and AI-generated text detection. However, existing methods often struggle with content-style entanglement, where models learn spurious correlations between authors' writing styles and topics, leading to poor generalization across domains. To address this challenge, we propose Explainable Authorship Variational Autoencoder (EAVAE), a novel framework that explicitly disentangles style from content through architectural separation-by-design. EAVAE first pretrains style encoders using supervised contrastive learning on diverse authorship data, then finetunes with a Variational Autoencoder (VEA) architecture using separate encoders for style and content representations. Disentanglement is enforced through a novel discriminator that not only distinguishes whether pairs of style/content representations belong to the same or different authors/content sources, but also generates natural language explanation for their decision, simultaneously mitigating confounding information and enhancing interpretability. Extensive experiments demonstrate the effectiveness of EAVAE. On authorship attribution, we achieve state-of-the-art performance on various datasets, including Amazon Reviews, PAN21, and HRS. For AI-generated text detection, EAVAE excels in few-shot learning over the M4 dataset. Code and data repositories are available onlinehttps://github.com/hieum98/avae https://huggingface.co/collections/Hieuman/document-level-authorship-datasets.

생성 AI 시대의 일반화 가능한 저자 식별을 위한 설명 가능한 분리 표현 학습

Explainable Disentangled Representation Learning for Generalizable Authorship Attribution in the Era of Generative AI

초록

Support