Patchscope：一個統一的框架，用於檢查語言模型的隱藏表示。

摘要

檢查大型語言模型（LLMs）中隱藏表示所編碼的信息可以解釋模型的行為並驗證其與人類價值觀的一致性。鑒於LLMs在生成人類可理解文本方面的能力，我們建議利用模型本身來以自然語言解釋其內部表示。我們引入了一個名為 Patchscopes 的框架，並展示了如何使用它來回答關於LLM計算的各種研究問題。我們指出，基於將表示投影到詞彙空間並介入LLM計算的先前可解釋性方法，可以被視為此框架的特殊實例。此外，一些缺點，如無法檢查早期層或缺乏表達能力，可以通過 Patchscope 來緩解。除了統一先前的檢查技術，Patchscopes 還開啟了新的可能性，例如使用更強大的模型來解釋較小模型的表示，並開啟了新的應用，如多跳推理中的自我校正。

English

Inspecting the information encoded in hidden representations of large language models (LLMs) can explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer a wide range of research questions about an LLM's computation. We show that prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation, can be viewed as special instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by a Patchscope. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and unlocks new applications such as self-correction in multi-hop reasoning.

Patchscope：一個統一的框架，用於檢查語言模型的隱藏表示。

Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models

摘要

Support