Patchscope:一個統一的框架,用於檢查語言模型的隱藏表示。
Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models
January 11, 2024
作者: Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, Mor Geva
cs.AI
摘要
檢查大型語言模型(LLMs)中隱藏表示所編碼的信息可以解釋模型的行為並驗證其與人類價值觀的一致性。鑒於LLMs在生成人類可理解文本方面的能力,我們建議利用模型本身來以自然語言解釋其內部表示。我們引入了一個名為 Patchscopes 的框架,並展示了如何使用它來回答關於LLM計算的各種研究問題。我們指出,基於將表示投影到詞彙空間並介入LLM計算的先前可解釋性方法,可以被視為此框架的特殊實例。此外,一些缺點,如無法檢查早期層或缺乏表達能力,可以通過 Patchscope 來緩解。除了統一先前的檢查技術,Patchscopes 還開啟了新的可能性,例如使用更強大的模型來解釋較小模型的表示,並開啟了新的應用,如多跳推理中的自我校正。
English
Inspecting the information encoded in hidden representations of large
language models (LLMs) can explain models' behavior and verify their alignment
with human values. Given the capabilities of LLMs in generating
human-understandable text, we propose leveraging the model itself to explain
its internal representations in natural language. We introduce a framework
called Patchscopes and show how it can be used to answer a wide range of
research questions about an LLM's computation. We show that prior
interpretability methods based on projecting representations into the
vocabulary space and intervening on the LLM computation, can be viewed as
special instances of this framework. Moreover, several of their shortcomings
such as failure in inspecting early layers or lack of expressivity can be
mitigated by a Patchscope. Beyond unifying prior inspection techniques,
Patchscopes also opens up new possibilities such as using a more capable model
to explain the representations of a smaller model, and unlocks new applications
such as self-correction in multi-hop reasoning.