我已全面覆盖：通过稀疏自编码器解读大语言模型中的推理特征

摘要

大型语言模型（LLMs）在自然语言处理领域取得了显著成就。近期进展催生了一类新型推理LLMs；例如，开源模型DeepSeek-R1通过深度融合思维与复杂推理，实现了业界领先的性能。尽管这些模型展现出令人瞩目的能力，其内部推理机制仍未被充分探索。本研究采用稀疏自编码器（SAEs）方法，旨在学习神经网络潜在表示的稀疏分解，以识别DeepSeek-R1系列模型中驱动推理的特征。首先，我们提出了一种从SAE表示中提取候选“推理特征”的方法。通过实证分析与可解释性方法验证这些特征，我们证明了它们与模型推理能力的直接关联。尤为关键的是，我们展示了系统性地引导这些特征能够提升推理性能，为LLMs中的推理机制提供了首个机制性解释。代码已发布于https://github.com/AIRI-Institute/SAE-Reasoning。

English

Large Language Models (LLMs) have achieved remarkable success in natural language processing. Recent advances have led to the developing of a new class of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved state-of-the-art performance by integrating deep thinking and complex reasoning. Despite these impressive capabilities, the internal reasoning mechanisms of such models remain unexplored. In this work, we employ Sparse Autoencoders (SAEs), a method to learn a sparse decomposition of latent representations of a neural network into interpretable features, to identify features that drive reasoning in the DeepSeek-R1 series of models. First, we propose an approach to extract candidate ''reasoning features'' from SAE representations. We validate these features through empirical analysis and interpretability methods, demonstrating their direct correlation with the model's reasoning abilities. Crucially, we demonstrate that steering these features systematically enhances reasoning performance, offering the first mechanistic account of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning

我已全面覆盖：通过稀疏自编码器解读大语言模型中的推理特征

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

摘要

Support