ICA 렌즈: 추가 사전 학습 없이 언어 모델 해석하기

초록

언어 모델 표현에서 해석 가능한 방향을 찾는 것은 모델 행동을 이해하고 제어하는 데 매우 중요하다. 희소 자동 인코더(SAE)는 이러한 목적의 표준 도구가 되었지만, 이를 기본적인 첫 번째 렌즈로 사용하려면 종종 대규모 과완전 사전을 훈련, 저장 및 평가해야 한다. 이러한 병목 현상은 신속한 탐색을 제한할 뿐만 아니라, 또 다른 신경 사전을 훈련하기 전에 활성화 기하학에서 이미 얼마나 많은 해석 가능한 구조가 보이는지에 대한 근본적인 질문을 제기한다. 우리의 직관은 단순하다. 많은 해석 가능한 방향은 토큰에 대해 선택적이며, 이러한 방향은 무작위 방향보다 덜 가우시안하게 보여야 한다는 것이다. 따라서 우리는 언어 모델 해석 가능성을 위한 간결한 렌즈로서 비가우시안 방향을 찾는 고전적 방법인 독립 성분 분석(ICA)을 재검토한다. ICA는 LLM 해석 가능성에서 과소평가되어 왔는데, 이는 기존의 사용법이 LLM 활성화에 취약한 기성 ICA 구현에 의존하는 경우가 많았고, 복구된 방향을 검사하고 평가하기 위한 체계적인 도구가 부족했기 때문이다. 이러한 격차를 해소하기 위해 우리는 ICALens를 도입한다. 이는 LLM 표현에 대한 안정적이고 효율적이며 감사 가능한 ICA 분석을 위한 최초의 실용적인 워크플로우이다. GPU 병렬 FastICA 파이프라인과 LLM 특화 안정성 레시피 및 개선된 적합 진단을 결합하여, 효율적이고 신뢰할 수 있는 계층별 분석을 가능하게 한다. GPT-2 Small, Gemma 2 2B 및 Qwen 3.5 2B Base에서 ICALens는 계층별 그래디언트 기반 사전 훈련 없이도 효율적으로 간결하고 인간이 해석 가능한 방향을 복구한다. SAEBench에서 ICA는 희소 프로빙에서 공개 SAE와 경쟁력을 보이며, 소규모 및 중간 예산 하에서 목표 프로브 섭동에서는 이를 능가한다. 이러한 결과는 ICA가 약한 기준선이 아니라 언어 모델 표현을 탐색하기 위한 효율적이고 보완적인 첫 번째 렌즈로 간주되어야 함을 시사한다.

English

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.