FaithfulSAE: Verso l'acquisizione di caratteristiche fedeli con Autoencoder Sparse senza dipendenze da dataset esterni

Abstract

Gli Autoencoder Sparse (SAE) sono emersi come una soluzione promettente per scomporre le rappresentazioni dei grandi modelli linguistici in caratteristiche interpretabili. Tuttavia, Paulo e Belrose (2025) hanno evidenziato instabilità tra diversi semi di inizializzazione, e Heap et al. (2025) hanno sottolineato che gli SAE potrebbero non catturare le caratteristiche interne del modello. Questi problemi probabilmente derivano dall'addestramento degli SAE su dataset esterni - raccolti dal Web o generati da un altro modello - che possono contenere dati fuori distribuzione (OOD) oltre le capacità di generalizzazione del modello. Ciò può portare a caratteristiche SAE allucinate, che definiamo "Fake Features", che rappresentano erroneamente le attivazioni interne del modello. Per affrontare questi problemi, proponiamo FaithfulSAE, un metodo che addestra gli SAE sul dataset sintetico generato dal modello stesso. Utilizzando FaithfulSAE, dimostriamo che l'addestramento degli SAE su dataset di istruzioni meno OOD risulta in SAE più stabili tra i semi. In particolare, i FaithfulSAE superano gli SAE addestrati su dataset basati sul Web nel task di probing degli SAE e mostrano un rapporto di Fake Feature più basso in 5 modelli su 7. Nel complesso, il nostro approccio elimina la dipendenza da dataset esterni, migliorando l'interpretabilità catturando meglio le caratteristiche interne del modello e sottolineando l'importanza spesso trascurata dei dataset di addestramento degli SAE.

English

Sparse Autoencoders (SAEs) have emerged as a promising solution for decomposing large language model representations into interpretable features. However, Paulo and Belrose (2025) have highlighted instability across different initialization seeds, and Heap et al. (2025) have pointed out that SAEs may not capture model-internal features. These problems likely stem from training SAEs on external datasets - either collected from the Web or generated by another model - which may contain out-of-distribution (OOD) data beyond the model's generalisation capabilities. This can result in hallucinated SAE features, which we term "Fake Features", that misrepresent the model's internal activations. To address these issues, we propose FaithfulSAE, a method that trains SAEs on the model's own synthetic dataset. Using FaithfulSAEs, we demonstrate that training SAEs on less-OOD instruction datasets results in SAEs being more stable across seeds. Notably, FaithfulSAEs outperform SAEs trained on web-based datasets in the SAE probing task and exhibit a lower Fake Feature Ratio in 5 out of 7 models. Overall, our approach eliminates the dependency on external datasets, advancing interpretability by better capturing model-internal features while highlighting the often neglected importance of SAE training datasets.

FaithfulSAE: Verso l'acquisizione di caratteristiche fedeli con Autoencoder Sparse senza dipendenze da dataset esterni

FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies

Abstract

Support