PolySAE:通过多项式解码建模稀疏自编码器中的特征交互
PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding
February 1, 2026
作者: Panagiotis Koromilas, Andreas D. Demou, James Oldfield, Yannis Panagakis, Mihalis Nicolaou
cs.AI
摘要
稀疏自编码器(SAE)作为一种通过将激活分解为字典原子的稀疏组合来解读神经网络表征的有效方法,近年来备受关注。然而,SAE假设特征通过线性重构以加性方式组合,这一假设无法捕捉组合结构:线性模型无法区分"星巴克"究竟源于"星"与"咖啡"特征的组合,还是仅源于二者的共现。这迫使SAE为复合概念分配整体性特征,而非将其分解为可解释的构成要素。我们提出PolySAE方法,通过扩展SAE解码器的高阶项来建模特征交互,同时保留对可解释性至关重要的线性编码器。通过在共享投影子空间进行低秩张量分解,PolySAE以较小的参数开销(GPT2模型上为3%)捕获成对及三重特征交互。在四个语言模型和三种SAE变体的测试中,PolySAE在保持相当重构误差的同时,平均探测F1值提升约8%,且类别条件特征分布之间的Wasserstein距离扩大2-10倍。关键的是,学习到的交互权重与共现频率的相关性可忽略不计(r=0.06,而SAE特征协方差为r=0.82),表明多项式项捕获的组合结构(如形态绑定和短语组合)与表层统计量基本无关。
English
Sparse autoencoders (SAEs) have emerged as a promising method for interpreting neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume that features combine additively through linear reconstruction, an assumption that cannot capture compositional structure: linear models cannot distinguish whether "Starbucks" arises from the composition of "star" and "coffee" features or merely their co-occurrence. This forces SAEs to allocate monolithic features for compound concepts rather than decomposing them into interpretable constituents. We introduce PolySAE, which extends the SAE decoder with higher-order terms to model feature interactions while preserving the linear encoder essential for interpretability. Through low-rank tensor factorization on a shared projection subspace, PolySAE captures pairwise and triple feature interactions with small parameter overhead (3% on GPT2). Across four language models and three SAE variants, PolySAE achieves an average improvement of approximately 8% in probing F1 while maintaining comparable reconstruction error, and produces 2-10times larger Wasserstein distances between class-conditional feature distributions. Critically, learned interaction weights exhibit negligible correlation with co-occurrence frequency (r = 0.06 vs. r = 0.82 for SAE feature covariance), suggesting that polynomial terms capture compositional structure, such as morphological binding and phrasal composition, largely independent of surface statistics.