부스팅을 활용하여 기존 SAE에 새로운 도메인 트릭 가르치기

초록

희소 오토인코더(Sparse Autoencoders, SAE)는 대규모 언어 모델(Large Language Models, LLM)의 내부 표현을 해석하는 강력한 도구로 부상했지만, 훈련 코퍼스에서 두드러지지 않는 도메인 특정 특징을 포착하지 못하는 경우가 많다. 본 논문은 이러한 특징 맹점을 해결하기 위해 완전한 재훈련 없이도 적용 가능한 잔차 학습 접근법을 소개한다. 우리는 사전 훈련된 SAE의 도메인 특정 텍스트에 대한 재구성 오류를 모델링하기 위해 보조 SAE를 훈련시키는 방법을 제안하며, 이를 통해 주 모델이 놓친 특징을 효과적으로 포착한다. 추론 과정에서 두 모델의 출력을 합산함으로써, 여러 전문 도메인에서 LLM의 교차 엔트로피와 설명된 분산 지표 모두에서 상당한 개선을 입증한다. 실험 결과, 이 방법은 기존 SAE의 일반 작업 성능을 유지하면서도 새로운 도메인 지식을 효율적으로 통합함을 보여준다. 이 접근법은 연구자들이 관심 있는 특정 도메인에 대해 SAE의 해석 가능성을 선택적으로 강화할 수 있게 하여, LLM의 표적 기계적 해석 가능성에 대한 새로운 가능성을 열어준다.

English

Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.

부스팅을 활용하여 기존 SAE에 새로운 도메인 트릭 가르치기

Teach Old SAEs New Domain Tricks with Boosting

초록

Support