我已在各方面做好準備：透過稀疏自編碼器解讀大型語言模型中的推理特徵

摘要

大型語言模型（LLMs）在自然語言處理領域取得了顯著成就。近期進展催生了一類新型推理LLMs；例如，開源的DeepSeek-R1通過整合深度思考與複雜推理，實現了頂尖性能。儘管這些模型展現了令人印象深刻的能力，其內部的推理機制仍未被深入探索。在本研究中，我們採用稀疏自編碼器（SAEs）這一方法，學習神經網絡潛在表示的稀疏分解，以識別驅動DeepSeek-R1系列模型推理的特徵。首先，我們提出了一種從SAE表示中提取候選「推理特徵」的方法。通過實證分析與可解釋性方法，我們驗證了這些特徵，並展示了它們與模型推理能力的直接關聯。關鍵在於，我們證明了系統性地引導這些特徵能夠提升推理性能，為LLMs的推理機制提供了首個機械論解釋。代碼可於https://github.com/AIRI-Institute/SAE-Reasoning獲取。

English

Large Language Models (LLMs) have achieved remarkable success in natural language processing. Recent advances have led to the developing of a new class of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved state-of-the-art performance by integrating deep thinking and complex reasoning. Despite these impressive capabilities, the internal reasoning mechanisms of such models remain unexplored. In this work, we employ Sparse Autoencoders (SAEs), a method to learn a sparse decomposition of latent representations of a neural network into interpretable features, to identify features that drive reasoning in the DeepSeek-R1 series of models. First, we propose an approach to extract candidate ''reasoning features'' from SAE representations. We validate these features through empirical analysis and interpretability methods, demonstrating their direct correlation with the model's reasoning abilities. Crucially, we demonstrate that steering these features systematically enhances reasoning performance, offering the first mechanistic account of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning

我已在各方面做好準備：透過稀疏自編碼器解讀大型語言模型中的推理特徵

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

摘要

Support