Patchscope：用于检查语言模型隐藏表示的统一框架

摘要

检查大型语言模型（LLMs）隐藏表示中编码的信息可以解释模型的行为并验证其与人类价值观的一致性。鉴于LLMs在生成人类可理解文本方面的能力，我们提出利用模型本身以自然语言解释其内部表示。我们引入了一个名为Patchscopes的框架，并展示了如何使用它来回答关于LLM计算的各种研究问题。我们表明，基于将表示投影到词汇空间并在LLM计算中进行干预的先前可解释性方法，可以看作是这一框架的特殊实例。此外，一些先前方法的缺点，如无法检查早期层或缺乏表现力，可以通过Patchscope来缓解。除了统一先前的检查技术，Patchscopes还开辟了新的可能性，例如利用更强大的模型来解释较小模型的表示，并解锁了新的应用，如多跳推理中的自我校正。

English

Inspecting the information encoded in hidden representations of large language models (LLMs) can explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer a wide range of research questions about an LLM's computation. We show that prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation, can be viewed as special instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by a Patchscope. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and unlocks new applications such as self-correction in multi-hop reasoning.

Patchscope：用于检查语言模型隐藏表示的统一框架

Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models

摘要

Support