我已在各方面做好準備:透過稀疏自編碼器解讀大型語言模型中的推理特徵
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders
March 24, 2025
作者: Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y. Rogov, Elena Tutubalina, Ivan Oseledets
cs.AI
摘要
大型語言模型(LLMs)在自然語言處理領域取得了顯著成就。近期進展催生了一類新型推理LLMs;例如,開源的DeepSeek-R1通過整合深度思考與複雜推理,實現了頂尖性能。儘管這些模型展現了令人印象深刻的能力,其內部的推理機制仍未被深入探索。在本研究中,我們採用稀疏自編碼器(SAEs)這一方法,學習神經網絡潛在表示的稀疏分解,以識別驅動DeepSeek-R1系列模型推理的特徵。首先,我們提出了一種從SAE表示中提取候選「推理特徵」的方法。通過實證分析與可解釋性方法,我們驗證了這些特徵,並展示了它們與模型推理能力的直接關聯。關鍵在於,我們證明了系統性地引導這些特徵能夠提升推理性能,為LLMs的推理機制提供了首個機械論解釋。代碼可於https://github.com/AIRI-Institute/SAE-Reasoning獲取。
English
Large Language Models (LLMs) have achieved remarkable success in natural
language processing. Recent advances have led to the developing of a new class
of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved
state-of-the-art performance by integrating deep thinking and complex
reasoning. Despite these impressive capabilities, the internal reasoning
mechanisms of such models remain unexplored. In this work, we employ Sparse
Autoencoders (SAEs), a method to learn a sparse decomposition of latent
representations of a neural network into interpretable features, to identify
features that drive reasoning in the DeepSeek-R1 series of models. First, we
propose an approach to extract candidate ''reasoning features'' from SAE
representations. We validate these features through empirical analysis and
interpretability methods, demonstrating their direct correlation with the
model's reasoning abilities. Crucially, we demonstrate that steering these
features systematically enhances reasoning performance, offering the first
mechanistic account of reasoning in LLMs. Code available at
https://github.com/AIRI-Institute/SAE-ReasoningSummary
AI-Generated Summary