ICA透镜：无需训练额外字典的语言模型解释方法

摘要

在语言模型表征中寻找可解释的方向，对于理解与控制模型行为至关重要。稀疏自编码器已为此成为标准工具，但将其作为默认的首要分析视角，通常需要训练、存储和评估大型过完备字典。这一瓶颈限制了快速探索，并引发一个根本性问题：在训练另一种神经字典之前，从激活几何结构中已经能观察到多少可解释结构？我们的直觉很简单：许多可解释方向对token具有选择性，而这些方向应比随机方向更不服从高斯分布。因此，我们重新审视独立成分分析这一经典的寻找非高斯方向的方法，将其作为语言模型可解释性的紧凑视角。我们发现，独立成分分析在大语言模型可解释性方面被低估了，因为以往的使用通常依赖现成的独立成分分析实现，这些实现在大语言模型激活上存在脆弱性，且缺乏系统化工具来检查和评估恢复出的方向。为弥补这些不足，我们引入了ICALens，这是首个用于对大语言模型表征进行稳定、高效且可审计的独立成分分析分析的实用工作流程。它结合了优化的GPU并行FastICA流程、面向大语言模型的稳定性配方以及更好的拟合诊断方法，从而支持高效可靠的逐层分析。在GPT-2 Small、Gemma 2 2B和Qwen 3.5 2B Base上，ICALens能高效恢复紧凑、人类可解释的方向，无需逐层基于梯度的字典训练。在SAEBench上，独立成分分析在稀疏探针任务中与公开的稀疏自编码器性能相当，并在中小规模预算下的定向探针扰动中表现更优。这些结果表明，独立成分分析不应被视为一个弱势基线，而应作为探索语言模型表征的一种高效且互补的首要分析视角。

English

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.