보여주고, 말하지 마라: 설명 가능한 AI 생성 텍스트 탐지

초록

AI가 생성한 텍스트 탐지 연구는 인간의 글과 AI의 글을 구별하기 위한 다양한 접근법을 제시해 왔으며, 그중 일부는 높은 분포 내 성능을 달성했습니다. 그러나 실제 적용 가능성은 정체되어 있는데, 이는 교수와 같은 사용자에게 설명이 첨부되지 않은 숫자 점수만 제시되는 등 출력 결과가 사용자의 요구와 일치하지 않기 때문입니다. 우리는 이 문제를 처음부터 설명 가능성을 내장한 새로운 아키텍처인 TELL로 해결합니다. 비교를 위해 다른 탐지기처럼 숫자 점수를 여전히 제공하지만, TELL은 근본적으로 다른 접근 방식을 취합니다. 즉, 모델이 텍스트를 AI 또는 인간이 작성했다고 판단하게 하는 '단서(tells)'를 사용자에게 보여줌으로써, 사용자가 글의 맥락과 추정된 저자에 대한 자신의 판단과 이해를 바탕으로 누가 글을 썼는지 결정할 수 있도록 하는 것입니다. 우리는 도메인 특화 저작자 주석이 포함된 맞춤형 SFT 데이터셋으로 TELL을 훈련시키고, 커리큘럼 학습을 적용한 GRPO를 사용하여 시스템을 추가로 정교화하여 성능을 향상시킵니다. 우리는 최신 탐지기와 경쟁력 있는 성능(AUROC 0.927)을 달성하면서도 탐지기 결정의 근거를 설명하는 주석을 기본적으로 제공합니다. 또한 인간 주석 데이터셋을 사용하여 설명의 품질을 평가한 결과, 주석의 구체성, 반증 가능성, 일관성, 타당성 및 근거에 대해 높은 승률(평균 72.3%)을 보고하며, 이를 통해 사용자가 비판적으로 생각하고 스스로 결정을 내릴 수 있도록 합니다. 따라서 우리의 연구는 AI 생성 텍스트 탐지 문제를 인간 중심의 관점으로 재구성하고, 본질적인 설명 가능성에 초점을 맞춘 새로운 탐지기 계열을 위한 길을 열어줍니다.

English

Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.