여기 모든 기반을 다루었습니다: 희소 오토인코더를 통해 대형 언어 모델의 추론 특성 해석하기

초록

대규모 언어 모델(LLMs)은 자연어 처리 분야에서 놀라운 성과를 거두었습니다. 최근의 발전으로 인해 새로운 종류의 추론 LLMs가 개발되었으며, 예를 들어 오픈소스 DeepSeek-R1은 깊은 사고와 복잡한 추론을 통합하여 최첨단 성능을 달성했습니다. 이러한 인상적인 능력에도 불구하고, 이러한 모델의 내부 추론 메커니즘은 여전히 탐구되지 않은 상태입니다. 본 연구에서는 신경망의 잠재 표현을 해석 가능한 특징으로 희소 분해하는 방법인 Sparse Autoencoders(SAEs)를 사용하여 DeepSeek-R1 시리즈 모델의 추론을 주도하는 특징을 식별합니다. 먼저, SAE 표현에서 '추론 특징' 후보를 추출하는 접근법을 제안합니다. 이러한 특징을 실증적 분석과 해석 가능성 방법을 통해 검증하며, 이들이 모델의 추론 능력과 직접적인 상관관계가 있음을 입증합니다. 특히, 이러한 특징을 체계적으로 조정함으로써 추론 성능을 향상시킬 수 있음을 보여주며, 이는 LLMs의 추론에 대한 첫 번째 기계적 설명을 제공합니다. 코드는 https://github.com/AIRI-Institute/SAE-Reasoning에서 확인할 수 있습니다.

English

Large Language Models (LLMs) have achieved remarkable success in natural language processing. Recent advances have led to the developing of a new class of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved state-of-the-art performance by integrating deep thinking and complex reasoning. Despite these impressive capabilities, the internal reasoning mechanisms of such models remain unexplored. In this work, we employ Sparse Autoencoders (SAEs), a method to learn a sparse decomposition of latent representations of a neural network into interpretable features, to identify features that drive reasoning in the DeepSeek-R1 series of models. First, we propose an approach to extract candidate ''reasoning features'' from SAE representations. We validate these features through empirical analysis and interpretability methods, demonstrating their direct correlation with the model's reasoning abilities. Crucially, we demonstrate that steering these features systematically enhances reasoning performance, offering the first mechanistic account of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning

여기 모든 기반을 다루었습니다: 희소 오토인코더를 통해 대형 언어 모델의 추론 특성 해석하기

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

초록

Support