稀疏自编码器能用于分解和解释转向向量吗？

摘要

转向向量是控制大型语言模型行为的一种有前途的方法。然而，其基本机制仍然知之甚少。虽然稀疏自编码器（SAEs）可能提供一种解释转向向量的潜在方法，但最近的研究结果显示，SAE重构的向量通常缺乏原始向量的转向特性。本文研究了为什么直接将SAEs应用于转向向量会产生误导性的分解，确定了两个原因：（1）转向向量落在SAEs设计的输入分布之外，以及（2）转向向量在特征方向上可能具有有意义的负投影，而SAEs并未设计用于适应这种情况。这些限制阻碍了直接利用SAEs解释转向向量。

English

Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.