学习LLM激活的生成式元模型

摘要

现有分析神经网络激活的方法（如主成分分析和稀疏自编码器）依赖于强结构性假设。生成模型提供了另一种路径：它们无需此类假设即可揭示结构，并可作为提升干预保真度的先验模型。我们通过在一亿个残差流激活上训练扩散模型来探索这一方向，构建了学习网络内部状态分布的"元模型"。研究发现，扩散损失随计算量增加而平滑下降，并能可靠预测下游效用。特别是将元模型习得的先验应用于导向干预时，可提升生成流畅度，且损失越低改善幅度越大。此外，元模型的神经元会逐渐将概念分离至独立单元，其稀疏探测分数随损失下降而提升。这些结果表明，生成式元模型为无需限制性结构假设的可解释性研究提供了可扩展路径。项目页面：https://generative-latent-prior.github.io。

English

Existing approaches for analyzing neural network activations, such as PCA and sparse autoencoders, rely on strong structural assumptions. Generative models offer an alternative: they can uncover structure without such assumptions and act as priors that improve intervention fidelity. We explore this direction by training diffusion models on one billion residual stream activations, creating "meta-models" that learn the distribution of a network's internal states. We find that diffusion loss decreases smoothly with compute and reliably predicts downstream utility. In particular, applying the meta-model's learned prior to steering interventions improves fluency, with larger gains as loss decreases. Moreover, the meta-model's neurons increasingly isolate concepts into individual units, with sparse probing scores that scale as loss decreases. These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions. Project page: https://generative-latent-prior.github.io.

学习LLM激活的生成式元模型

Learning a Generative Meta-Model of LLM Activations

摘要

Support