展示,而非告知:可解釋的AI生成文本偵測
Show, Don't TELL: Explainable AI-Generated Text Detection
May 27, 2026
作者: Aldan Creo, Suraj Ranganath
cs.AI
摘要
關於AI生成文本檢測的研究已提出多種方法來區分人類與AI的散文,其中部分方法在分佈內數據上達到了高效能。然而,由於其輸出與使用者(如教授)的需求不一致——使用者僅獲得一個無附帶說明的數值分數——因此這些方法在現實世界的應用上仍停滯不前。我們透過一種新穎的架構TELL來解決此問題,該架構從根本層面內建可解釋性。儘管我們的系統如同其他檢測器般仍提供數值分數以供比較,但TELL採取根本不同的策略:我們旨在向使用者展示模型認為文本為AI或人類撰寫的「線索」,讓使用者能依據自身判斷以及對寫作背景與疑似作者的了解,自行決定文本出自誰手。我們在一個特定領域的作者身分註解自訂SFT數據集上訓練TELL,並進一步使用結合課程學習的GRPO來微調系統以提升效能。我們達到了與最先進檢測器相當的效能(AUROC 0.927),同時原生提供解釋檢測器決策依據的註解。我們進一步使用人類註解數據集評估解釋品質,結果顯示在註解的具體性、可反駁性、連貫性、合理性與根據性方面取得高勝率(平均72.3%),使使用者能批判性思考並自行判斷。因此,我們的工作從以人為本的觀點重新構想了AI生成文本檢測問題,並為專注於原生可解釋性的新一代檢測器鋪平了道路。
English
Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.