Approfondimenti a Livello di Caratteristiche per il Rilevamento di Testo Artificiale con Autoencoder Sparse

Abstract

Il rilevamento di testi artificiali (ATD) sta diventando sempre più importante con l'ascesa dei modelli linguistici avanzati di grandi dimensioni (LLM). Nonostante i numerosi sforzi, nessun algoritmo singolo si comporta in modo costantemente efficace su diversi tipi di testo non visto o garantisce una generalizzazione efficace ai nuovi LLM. L'interpretabilità gioca un ruolo cruciale nel raggiungimento di questo obiettivo. In questo studio, miglioriamo l'interpretabilità dell'ATD utilizzando Autoencoder Sparse (SAE) per estrarre caratteristiche dal flusso residuo di Gemma-2-2b. Identifichiamo sia caratteristiche interpretabili che efficienti, analizzandone la semantica e la rilevanza attraverso statistiche specifiche per dominio e modello, un approccio di steering e interpretazione manuale o basata su LLM. I nostri metodi offrono preziose intuizioni su come i testi provenienti da vari modelli differiscano dai contenuti scritti da esseri umani. Dimostriamo che i moderni LLM hanno uno stile di scrittura distinto, specialmente in domini ad alta densità di informazioni, anche se possono produrre output simili a quelli umani con prompt personalizzati.

English

Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.

Approfondimenti a Livello di Caratteristiche per il Rilevamento di Testo Artificiale con Autoencoder Sparse

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Abstract

Support