示せ、語るな：説明可能なAI生成テキスト検出

要旨

AI生成テキスト検出に関する研究では、人間の文章とAIの文章を区別するための多くのアプローチが提案されており、その中には高い分布内性能を達成するものもある。しかし、現実世界での適用可能性は停滞している。なぜなら、それらの出力は教授などのユーザーのニーズと乖離しており、数値スコアのみが提示され、それに付随する説明がないからである。我々はこの問題に取り組むため、根本から説明可能性を組み込んだ新しいアーキテクチャTELLを提案する。我々のシステムは他の検出器と同様に比較可能性のために数値スコアを提供するが、TELLは根本的に異なるアプローチをとる。すなわち、モデルがテキストをAI生成または人間作成と判断する根拠となる「兆候（tells）」をユーザーに示し、ユーザー自身の判断と、執筆の文脈や執筆者とされる人物の理解に基づいて誰が書いたかを決定できるようにすることを目指す。我々はTELLを、ドメイン固有の著者性アノテーションからなるカスタムSFTデータセットで訓練し、さらにカリキュラム学習を用いたGRPOによりシステムを洗練させて性能を向上させる。最先端の検出器と同等の性能（AUROC 0.927）を達成しつつ、検出器の判断根拠を説明するアノテーションをネイティブに提供する。さらに、人間によるアノテーションデータセットを用いて説明の質を評価し、アノテーションの具体性、反証可能性、一貫性、妥当性、根拠付けにおいて高い勝率（平均72.3%）を報告する。これにより、ユーザーは批判的に考え、自ら判断することができる。我々の研究は、これによりAI生成テキスト検出の問題を人間中心の視点から再構成し、ネイティブな説明可能性に焦点を当てた新しい検出器のファミリーへの道を開くものである。

English

Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.