ChatPaper.aiChatPaper

定位、引導與改進:大型語言模型中可操作機制解釋性的實用綜述

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

January 20, 2026
作者: Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, Zeping Yu, Chenming Shang, Xiao Liang, Jing Xiong, Hui Shen, Chaofan Tao, Zhengwu Liu, Senjie Jin, Zhiheng Xi, Dongdong Zhang, Sophia Ananiadou, Tao Gui, Ruobing Xie, Hayden Kwok-Hay So, Hinrich Schütze, Xuanjing Huang, Qi Zhang, Ngai Wong
cs.AI

摘要

機械可解釋性(MI)已成為闡明大型語言模型(LLM)不透明決策機制的關鍵路徑。然而,現有綜述多將MI視為觀測性科學,側重於歸納分析性見解,卻缺乏可操作性干預的系統性框架。為彌合此鴻溝,我們提出以「定位、導向、改進」流程為架構的實用性綜述。基於特定可解釋對象,我們正式將定位(診斷)與導向(干預)方法進行分類,以建立嚴謹的干預規範。進一步地,我們展示此框架如何實現對齊性、能力與效率的實質提升,從而將MI有效轉化為模型優化的可操作方法。本論文精選文獻清單詳見:https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey。
English
Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey.
PDF171January 22, 2026