LLM 활성화의 생성적 메타모델 학습

초록

기존의 신경망 활성화 분석 방법(PCA 및 희소 오토인코더 등)은 강력한 구조적 가정에 의존합니다. 생성 모델은 대안을 제공합니다. 이러한 가정 없이 구조를 발견할 수 있으며, 개입 정확도를 향상시키는 사전 분포(prior) 역할을 할 수 있습니다. 우리는 10억 개의 잔차 스트림 활성화에 대해 확산 모델을 훈련하여 네트워크의 내부 상태 분포를 학습하는 "메타모델"을 생성함으로써 이 방향을 탐구합니다. 확산 손실은 컴퓨팅 자원 증가에 따라 부드럽게 감소하며 하류 작업 유용성을 안정적으로 예측하는 것을 확인했습니다. 특히, 메타모델이 학습한 사전 분포를 조향(steering) 개입에 적용하면 유창성이 향상되며, 손실이 감소할수록 향상 폭이 커집니다. 더 나아가 메타모델의 뉴런들은 개념을 점차 개별 단위로 분리하며, 손실 감소에 따라 희소 프로빙 점수가 선형적으로 증가합니다. 이러한 결과는 생성적 메타모델이 제한적인 구조적 가정 없이 해석 가능성으로 나아가는 확장 가능한 경로를 제공함을 시사합니다. 프로젝트 페이지: https://generative-latent-prior.github.io.

English

Existing approaches for analyzing neural network activations, such as PCA and sparse autoencoders, rely on strong structural assumptions. Generative models offer an alternative: they can uncover structure without such assumptions and act as priors that improve intervention fidelity. We explore this direction by training diffusion models on one billion residual stream activations, creating "meta-models" that learn the distribution of a network's internal states. We find that diffusion loss decreases smoothly with compute and reliably predicts downstream utility. In particular, applying the meta-model's learned prior to steering interventions improves fluency, with larger gains as loss decreases. Moreover, the meta-model's neurons increasingly isolate concepts into individual units, with sparse probing scores that scale as loss decreases. These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions. Project page: https://generative-latent-prior.github.io.

LLM 활성화의 생성적 메타모델 학습

Learning a Generative Meta-Model of LLM Activations

초록

Support