Controlli di Sanità per Autoencoder Sparsi: Gli SAE Battono i Baseline Casuali?

Abstract

Gli autoencoder sparsi (SAE) sono emersi come uno strumento promettente per interpretare le reti neurali, scomponendo le loro attivazioni in insiemi sparsi di feature interpretabili dall'uomo. Recentemente sono state introdotte numerose varianti di SAE e sono state scalate con successo su modelli all'avanguardia. Nonostante l'entusiasmo, un numero crescente di risultati negativi in task downstream mette in dubbio che i SAE riescano a recuperare feature significative. Per indagare direttamente questo aspetto, conduciamo due valutazioni complementari. In un setup sintetico con feature ground-truth note, dimostriamo che i SAE recuperano solo il 9% delle feature reali nonostante raggiungano il 71% di varianza spiegata, mostrando che falliscono nel loro compito fondamentale anche quando la ricostruzione è efficace. Per valutare i SAE su attivazioni reali, introduciamo tre baseline che vincolano le direzioni delle feature dei SAE o i loro pattern di attivazione a valori casuali. Attraverso esperimenti estesi su più architetture SAE, mostriamo che le nostre baseline eguagliano i SAE addestrati in interpretabilità (0.87 vs 0.90), sparse probing (0.69 vs 0.72) e causal editing (0.73 vs 0.72). Complessivamente, questi risultati suggeriscono che i SAE nel loro stato attuale non scompongono in modo affidabile i meccanismi interni dei modelli.

English

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only 9% of true features despite achieving 71% explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.

Controlli di Sanità per Autoencoder Sparsi: Gli SAE Battono i Baseline Casuali?

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Abstract

Support