基於稀疏自編碼器的生成文本檢測之特徵層面洞察
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
March 5, 2025
作者: Kristian Kuznetsov, Laida Kushnareva, Polina Druzhinina, Anton Razzhigaev, Anastasia Voznyuk, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov
cs.AI
摘要
隨著大型語言模型(LLMs)的迅速發展,人工文本檢測(ATD)變得日益重要。儘管已有眾多研究努力,但尚無單一算法能在不同類型的未見文本上始終表現優異,或能保證對新型LLMs的有效泛化。可解釋性在實現這一目標中扮演著關鍵角色。在本研究中,我們通過使用稀疏自編碼器(SAE)從Gemma-2-2b的殘差流中提取特徵,來增強ATD的可解釋性。我們識別出既具可解釋性又高效的特徵,並通過領域和模型特定的統計分析、引導方法以及人工或基於LLM的解釋,來分析這些特徵的語義和相關性。我們的方法為理解不同模型生成的文本與人類撰寫內容之間的差異提供了寶貴的見解。我們展示出現代LLMs具有獨特的寫作風格,特別是在信息密集的領域中,即使它們能夠通過個性化提示生成類似人類的輸出。
English
Artificial Text Detection (ATD) is becoming increasingly important with the
rise of advanced Large Language Models (LLMs). Despite numerous efforts, no
single algorithm performs consistently well across different types of unseen
text or guarantees effective generalization to new LLMs. Interpretability plays
a crucial role in achieving this goal. In this study, we enhance ATD
interpretability by using Sparse Autoencoders (SAE) to extract features from
Gemma-2-2b residual stream. We identify both interpretable and efficient
features, analyzing their semantics and relevance through domain- and
model-specific statistics, a steering approach, and manual or LLM-based
interpretation. Our methods offer valuable insights into how texts from various
models differ from human-written content. We show that modern LLMs have a
distinct writing style, especially in information-dense domains, even though
they can produce human-like outputs with personalized prompts.Summary
AI-Generated Summary