Patchscope:用于检查语言模型隐藏表示的统一框架
Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models
January 11, 2024
作者: Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, Mor Geva
cs.AI
摘要
检查大型语言模型(LLMs)隐藏表示中编码的信息可以解释模型的行为并验证其与人类价值观的一致性。鉴于LLMs在生成人类可理解文本方面的能力,我们提出利用模型本身以自然语言解释其内部表示。我们引入了一个名为Patchscopes的框架,并展示了如何使用它来回答关于LLM计算的各种研究问题。我们表明,基于将表示投影到词汇空间并在LLM计算中进行干预的先前可解释性方法,可以看作是这一框架的特殊实例。此外,一些先前方法的缺点,如无法检查早期层或缺乏表现力,可以通过Patchscope来缓解。除了统一先前的检查技术,Patchscopes还开辟了新的可能性,例如利用更强大的模型来解释较小模型的表示,并解锁了新的应用,如多跳推理中的自我校正。
English
Inspecting the information encoded in hidden representations of large
language models (LLMs) can explain models' behavior and verify their alignment
with human values. Given the capabilities of LLMs in generating
human-understandable text, we propose leveraging the model itself to explain
its internal representations in natural language. We introduce a framework
called Patchscopes and show how it can be used to answer a wide range of
research questions about an LLM's computation. We show that prior
interpretability methods based on projecting representations into the
vocabulary space and intervening on the LLM computation, can be viewed as
special instances of this framework. Moreover, several of their shortcomings
such as failure in inspecting early layers or lack of expressivity can be
mitigated by a Patchscope. Beyond unifying prior inspection techniques,
Patchscopes also opens up new possibilities such as using a more capable model
to explain the representations of a smaller model, and unlocks new applications
such as self-correction in multi-hop reasoning.