稀疏自编码器能用于分解和解释转向向量吗?
Can sparse autoencoders be used to decompose and interpret steering vectors?
November 13, 2024
作者: Harry Mayne, Yushi Yang, Adam Mahdi
cs.AI
摘要
转向向量是控制大型语言模型行为的一种有前途的方法。然而,其基本机制仍然知之甚少。虽然稀疏自编码器(SAEs)可能提供一种解释转向向量的潜在方法,但最近的研究结果显示,SAE重构的向量通常缺乏原始向量的转向特性。本文研究了为什么直接将SAEs应用于转向向量会产生误导性的分解,确定了两个原因:(1)转向向量落在SAEs设计的输入分布之外,以及(2)转向向量在特征方向上可能具有有意义的负投影,而SAEs并未设计用于适应这种情况。这些限制阻碍了直接利用SAEs解释转向向量。
English
Steering vectors are a promising approach to control the behaviour of large
language models. However, their underlying mechanisms remain poorly understood.
While sparse autoencoders (SAEs) may offer a potential method to interpret
steering vectors, recent findings show that SAE-reconstructed vectors often
lack the steering properties of the original vectors. This paper investigates
why directly applying SAEs to steering vectors yields misleading
decompositions, identifying two reasons: (1) steering vectors fall outside the
input distribution for which SAEs are designed, and (2) steering vectors can
have meaningful negative projections in feature directions, which SAEs are not
designed to accommodate. These limitations hinder the direct use of SAEs for
interpreting steering vectors.Summary
AI-Generated Summary