ICA透鏡：無需訓練額外詞典的語言模型解讀

摘要

在語言模型表示中找出可解釋方向，對於理解與控制模型行為至關重要。稀疏自動編碼器（SAE）已成為此領域的標準工具，但將其作為預設的第一視角，往往需要訓練、儲存並評估大型過完備字典。此瓶頸限制了快速探索，並引發一個根本問題：在訓練另一個神經字典之前，從激活幾何中已能見到多少可解釋結構？我們的直覺很簡單：許多可解釋方向對token具有選擇性，而這些方向應比隨機方向更不似高斯分佈。因此，我們重新審視獨立成分分析（ICA）——一種尋找非高斯方向的經典方法——作為語言模型可解釋性的緊凑視角。我們發現ICA在大型語言模型可解釋性方面一直被低估，因為先前的應用常依賴現成的ICA實作，這些實作對LLM激活值較為脆弱，且缺乏用於檢查與評估所恢復方向的系統化工具。為填補這些缺口，我們介紹ICALens，這是第一個用於穩定、高效且可審計地對LLM表示進行ICA分析的實用工作流程。它結合了經GPU並行優化的FastICA流程，以及LLM專用穩定性方案與更佳的適配診斷方法，從而實現高效可靠的分層分析。在GPT-2 Small、Gemma 2 2B與Qwen 3.5 2B Base上，ICALens無需逐層基於梯度的字典訓練，即可高效恢復出緊凑且人類可解釋的方向。在SAEBench上，ICA在稀疏探測中與公開SAE競爭力相當，並在中小預算下的目標探測擾動中勝過後者。這些結果表明，ICA不應被視為弱基線，而應作為探索語言模型表示時高效且具互補性的第一視角。

English

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.