LLM에서 다의성 제어: 희소 오토인코더를 통한 증명 가능한 특성 복원

초록

우리는 대규모 언어 모델(Large Language Models, LLMs)의 해석을 위해 희소 오토인코더(Sparse Autoencoders, SAEs)를 사용하여 이론적으로 근거 있는 특징 복원을 달성하는 문제를 연구한다. 기존의 SAE 학습 알고리즘은 엄격한 수학적 보장이 부족하며, 하이퍼파라미터 민감도와 불안정성과 같은 실용적 한계를 겪고 있다. 이러한 문제를 해결하기 위해, 우리는 먼저 다의적 특징(polysemantic features)을 기저 단의적 개념(monosemantic concepts)의 희소 혼합으로 모델링함으로써 특징 식별 가능성(feature identifiability)에 대한 새로운 개념을 포함한 특징 복원 문제를 위한 통계적 프레임워크를 제안한다. 이 프레임워크를 기반으로, 우리는 신경망의 바이어스 매개변수를 적응적으로 조정하여 적절한 활성화 희소성을 보장하는 "바이어스 적응(bias adaptation)" 기법을 활용한 새로운 SAE 학습 알고리즘을 소개한다. 우리는 이 알고리즘이 제안된 통계 모델에서 샘플링된 입력 데이터에 대해 모든 단의적 특징을 정확히 복원함을 이론적으로 증명한다. 또한, 우리는 개선된 경험적 변형인 그룹 바이어스 적응(Group Bias Adaptation, GBA)을 개발하고, 최대 15억 개의 매개변수를 가진 LLMs에 적용할 때 벤치마크 방법 대비 우수한 성능을 입증한다. 이 연구는 이론적 복원 보장을 제공하는 첫 번째 SAE 알고리즘을 제시함으로써 SAE 학습의 신비를 해체하는 기초적인 단계를 나타내며, 이를 통해 향상된 기계적 해석 가능성(mechanistic interpretability)을 통해 더 투명하고 신뢰할 수 있는 AI 시스템 개발을 진전시킨다.

English

We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations such as hyperparameter sensitivity and instability. To address these issues, we first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability by modeling polysemantic features as sparse mixtures of underlying monosemantic concepts. Building on this framework, we introduce a new SAE training algorithm based on ``bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity. We theoretically prove that this algorithm correctly recovers all monosemantic features when input data is sampled from our proposed statistical model. Furthermore, we develop an improved empirical variant, Group Bias Adaptation (GBA), and demonstrate its superior performance against benchmark methods when applied to LLMs with up to 1.5 billion parameters. This work represents a foundational step in demystifying SAE training by providing the first SAE algorithm with theoretical recovery guarantees, thereby advancing the development of more transparent and trustworthy AI systems through enhanced mechanistic interpretability.

LLM에서 다의성 제어: 희소 오토인코더를 통한 증명 가능한 특성 복원

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders

초록

Support