我已全面覆盖:通过稀疏自编码器解读大语言模型中的推理特征
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders
March 24, 2025
作者: Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y. Rogov, Elena Tutubalina, Ivan Oseledets
cs.AI
摘要
大型语言模型(LLMs)在自然语言处理领域取得了显著成就。近期进展催生了一类新型推理LLMs;例如,开源模型DeepSeek-R1通过深度融合思维与复杂推理,实现了业界领先的性能。尽管这些模型展现出令人瞩目的能力,其内部推理机制仍未被充分探索。本研究采用稀疏自编码器(SAEs)方法,旨在学习神经网络潜在表示的稀疏分解,以识别DeepSeek-R1系列模型中驱动推理的特征。首先,我们提出了一种从SAE表示中提取候选“推理特征”的方法。通过实证分析与可解释性方法验证这些特征,我们证明了它们与模型推理能力的直接关联。尤为关键的是,我们展示了系统性地引导这些特征能够提升推理性能,为LLMs中的推理机制提供了首个机制性解释。代码已发布于https://github.com/AIRI-Institute/SAE-Reasoning。
English
Large Language Models (LLMs) have achieved remarkable success in natural
language processing. Recent advances have led to the developing of a new class
of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved
state-of-the-art performance by integrating deep thinking and complex
reasoning. Despite these impressive capabilities, the internal reasoning
mechanisms of such models remain unexplored. In this work, we employ Sparse
Autoencoders (SAEs), a method to learn a sparse decomposition of latent
representations of a neural network into interpretable features, to identify
features that drive reasoning in the DeepSeek-R1 series of models. First, we
propose an approach to extract candidate ''reasoning features'' from SAE
representations. We validate these features through empirical analysis and
interpretability methods, demonstrating their direct correlation with the
model's reasoning abilities. Crucially, we demonstrate that steering these
features systematically enhances reasoning performance, offering the first
mechanistic account of reasoning in LLMs. Code available at
https://github.com/AIRI-Institute/SAE-ReasoningSummary
AI-Generated Summary