PolySAE: Modellering van kenmerkinteracties in sparse autoencoders via polynoomdecodering

Samenvatting

Sparse autoencoders (SAE's) zijn naar voren gekomen als een veelbelovende methode voor het interpreteren van neurale netwerkrepresentaties door activeringen te ontbinden in sparse combinaties van woordenboekatomen. SAE's veronderstellen echter dat kenmerken additief combineren via lineaire reconstructie, een aanname die compositionele structuur niet kan vastleggen: lineaire modellen kunnen niet onderscheiden of "Starbucks" ontstaat uit de compositie van "ster" en "koffie"-kenmerken of slechts uit hun co-voorkomen. Dit dwingt SAE's om monolithische kenmerken toe te wijzen voor samengestelde concepten in plaats van ze te ontbinden in interpreteerbare constituenten. Wij introduceren PolySAE, dat de SAE-decoder uitbreidt met hogere-orde termen om kenmerkinteracties te modelleren, terwijl de lineaire encoder die essentieel is voor interpreteerbaarheid behouden blijft. Door middel van low-rank tensorfactorisatie op een gedeelde projectiesubruimte vangt PolySAE paarsgewijze en drievoudige kenmerkinteracties op met een kleine parameteroverhead (3% op GPT2). Over vier taalmodelen en drie SAE-varianten behaalt PolySAE een gemiddelde verbetering van ongeveer 8% in probing F1, bij een vergelijkbare reconstructiefout, en produceert het 2-10 keer grotere Wasserstein-afstanden tussen klasse-conditionele kenmerkverdelingen. Cruciaal is dat de geleerde interactiegewichten een verwaarloosbare correlatie vertonen met co-voorkomensfrequentie (r = 0,06 versus r = 0,82 voor SAE-kenmerkcovariantie), wat suggereert dat polynoomtermen compositionele structuur vastleggen, zoals morfologische binding en frasecompositie, grotendeels onafhankelijk van oppervlaktestatistieken.

English

Sparse autoencoders (SAEs) have emerged as a promising method for interpreting neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume that features combine additively through linear reconstruction, an assumption that cannot capture compositional structure: linear models cannot distinguish whether "Starbucks" arises from the composition of "star" and "coffee" features or merely their co-occurrence. This forces SAEs to allocate monolithic features for compound concepts rather than decomposing them into interpretable constituents. We introduce PolySAE, which extends the SAE decoder with higher-order terms to model feature interactions while preserving the linear encoder essential for interpretability. Through low-rank tensor factorization on a shared projection subspace, PolySAE captures pairwise and triple feature interactions with small parameter overhead (3% on GPT2). Across four language models and three SAE variants, PolySAE achieves an average improvement of approximately 8% in probing F1 while maintaining comparable reconstruction error, and produces 2-10times larger Wasserstein distances between class-conditional feature distributions. Critically, learned interaction weights exhibit negligible correlation with co-occurrence frequency (r = 0.06 vs. r = 0.82 for SAE feature covariance), suggesting that polynomial terms capture compositional structure, such as morphological binding and phrasal composition, largely independent of surface statistics.

PolySAE: Modellering van kenmerkinteracties in sparse autoencoders via polynoomdecodering

PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding

Samenvatting

Support