機制性可解釋性中的開放問題
Open Problems in Mechanistic Interpretability
January 27, 2025
作者: Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath
cs.AI
摘要
機制性可解釋性的目標在於理解神經網絡能力背後的計算機制,以實現具體的科學和工程目標。因此,這一領域的進展有望提供對人工智能系統行為更大的保證,並揭示有關智能本質的激動人心的科學問題。儘管在這些目標方面取得了近期的進展,但在這一領域中仍存在許多需要解決的開放問題,才能實現許多科學和實際的好處:我們的方法需要在概念和實踐上進行改進,以揭示更深層次的見解;我們必須找出如何最好地應用我們的方法來追求特定目標;而這一領域必須應對影響並受到我們工作影響的社會技術挑戰。這份前瞻性的回顧討論了機制性可解釋性的當前前沿和該領域可能受益於優先考慮的開放問題。
English
Mechanistic interpretability aims to understand the computational mechanisms
underlying neural networks' capabilities in order to accomplish concrete
scientific and engineering goals. Progress in this field thus promises to
provide greater assurance over AI system behavior and shed light on exciting
scientific questions about the nature of intelligence. Despite recent progress
toward these goals, there are many open problems in the field that require
solutions before many scientific and practical benefits can be realized: Our
methods require both conceptual and practical improvements to reveal deeper
insights; we must figure out how best to apply our methods in pursuit of
specific goals; and the field must grapple with socio-technical challenges that
influence and are influenced by our work. This forward-facing review discusses
the current frontier of mechanistic interpretability and the open problems that
the field may benefit from prioritizing.Summary
AI-Generated Summary