展示，而非告知：可解釋的AI生成文本偵測

摘要

關於AI生成文本檢測的研究已提出多種方法來區分人類與AI的散文，其中部分方法在分佈內數據上達到了高效能。然而，由於其輸出與使用者（如教授）的需求不一致——使用者僅獲得一個無附帶說明的數值分數——因此這些方法在現實世界的應用上仍停滯不前。我們透過一種新穎的架構TELL來解決此問題，該架構從根本層面內建可解釋性。儘管我們的系統如同其他檢測器般仍提供數值分數以供比較，但TELL採取根本不同的策略：我們旨在向使用者展示模型認為文本為AI或人類撰寫的「線索」，讓使用者能依據自身判斷以及對寫作背景與疑似作者的了解，自行決定文本出自誰手。我們在一個特定領域的作者身分註解自訂SFT數據集上訓練TELL，並進一步使用結合課程學習的GRPO來微調系統以提升效能。我們達到了與最先進檢測器相當的效能（AUROC 0.927），同時原生提供解釋檢測器決策依據的註解。我們進一步使用人類註解數據集評估解釋品質，結果顯示在註解的具體性、可反駁性、連貫性、合理性與根據性方面取得高勝率（平均72.3%），使使用者能批判性思考並自行判斷。因此，我們的工作從以人為本的觀點重新構想了AI生成文本檢測問題，並為專注於原生可解釋性的新一代檢測器鋪平了道路。

English

Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.