지식의 탄생: 대규모 언어 모델에서 시간, 공간, 그리고 규모를 넘나드는 창발적 특성

초록

본 논문은 대규모 언어 모델(LLM) 내에서 해석 가능한 범주적 특성의 출현을 연구하며, 이를 학습 체크포인트(시간), 트랜스포머 계층(공간), 다양한 모델 크기(규모)에 걸쳐 분석합니다. 기계적 해석 가능성을 위한 희소 오토인코더를 사용하여, 신경 활성화 내에서 특정 의미론적 개념이 언제, 어디서 출현하는지를 확인합니다. 결과는 여러 도메인에 걸쳐 특성 출현에 대한 명확한 시간적 및 규모별 임계값을 보여줍니다. 특히, 공간 분석은 초기 계층의 특성이 후기 계층에서 재출현하는 예상치 못한 의미론적 재활성을 드러내며, 이는 트랜스포머 모델의 표현 역학에 대한 표준 가정에 도전합니다.

English

This paper studies the emergence of interpretable categorical features within large language models (LLMs), analyzing their behavior across training checkpoints (time), transformer layers (space), and varying model sizes (scale). Using sparse autoencoders for mechanistic interpretability, we identify when and where specific semantic concepts emerge within neural activations. Results indicate clear temporal and scale-specific thresholds for feature emergence across multiple domains. Notably, spatial analysis reveals unexpected semantic reactivation, with early-layer features re-emerging at later layers, challenging standard assumptions about representational dynamics in transformer models.

지식의 탄생: 대규모 언어 모델에서 시간, 공간, 그리고 규모를 넘나드는 창발적 특성

The Birth of Knowledge: Emergent Features across Time, Space, and Scale in Large Language Models

초록

Support