定位、引导与优化:大型语言模型中可操作机制可解释性实用综述
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
January 20, 2026
作者: Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, Zeping Yu, Chenming Shang, Xiao Liang, Jing Xiong, Hui Shen, Chaofan Tao, Zhengwu Liu, Senjie Jin, Zhiheng Xi, Dongdong Zhang, Sophia Ananiadou, Tao Gui, Ruobing Xie, Hayden Kwok-Hay So, Hinrich Schütze, Xuanjing Huang, Qi Zhang, Ngai Wong
cs.AI
摘要
机械可解释性(MI)已成为揭示大型语言模型(LLM)不透明决策机制的关键方法。然而,现有综述主要将MI视为观测科学,侧重于分析性见解的总结,却缺乏可操作性干预的系统框架。为弥补这一空白,我们提出以“定位-调控-改进”为流程的实践性综述。我们基于特定可解释对象对定位(诊断)与调控(干预)方法进行形式化分类,以建立严谨的干预规程。进一步地,我们论证了该框架如何在对齐性、能力与效率三大维度实现实质性提升,从而将MI有效转化为可操作的模型优化方法论。本工作的精选论文列表详见:https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey。
English
Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey.