ICA Lens: 別の辞書を訓練せずに言語モデルを解釈する

要旨

言語モデルの表現において解釈可能な方向を見つけることは、モデルの動作の理解と制御にとって重要である。スパースオートエンコーダ（SAE）はこの目的のための標準的なツールとなっているが、それをデフォルトの第一のレンズとして使用するには、大規模な過完備辞書の学習、保存、評価をしばしば必要とする。このボトルネックは迅速な探索を制限し、新たな神経辞書を学習する前に活性化の幾何構造からどれだけの解釈可能な構造が既に見えているのかという根本的な問いを提起する。我々の直感は単純である：多くの解釈可能な方向はトークンに対して選択的であり、これらの方向はランダムな方向よりもガウス分布に従いにくいはずである。そこで我々は、非ガウス方向を見つける古典的手法である独立成分分析（ICA）を、言語モデルの解釈可能性のためのコンパクトなレンズとして再考する。ICAはLLMの解釈可能性において過小評価されてきた。なぜなら、従来の使用では既製のICA実装に依存することが多く、それらはLLMの活性化に対して脆く、復元された方向を検査・評価するための体系的なツールが不足していたからである。これらのギャップを埋めるため、我々はICALensを導入する。これはLLM表現の安定・効率的・監査可能なICA分析のための初の実用的ワークフローである。これは最適化されたGPU並列FastICAパイプラインと、LLM特有の安定化レシピおよびより良い適合診断を組み合わせ、効率的かつ信頼性の高い層ごとの分析を可能にする。GPT-2 Small、Gemma 2 2B、Qwen 3.5 2B Baseにおいて、ICALensは層ごとの勾配ベースの辞書学習なしで、コンパクトで人間に解釈可能な方向を効率的に復元する。SAEBenchでは、ICAはスパースプロービングにおいて公開SAEと競合し、小規模から中規模の予算下でのターゲットプローブ摂動においてそれらを上回る。これらの結果は、ICAを弱いベースラインとして見るべきではなく、言語モデルの表現を探索するための効率的かつ補完的な第一のレンズとして捉えるべきであることを示唆している。

English

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.