高维探针：通过向量符号架构解码大语言模型表示

摘要

尽管大型语言模型（LLMs）能力强大，但其内部表征仍显晦涩难懂。现有的可解释性方法，如直接对数归因（DLA）和稀疏自编码器（SAEs），因受限于模型的输出词汇表或特征命名不明确，提供的洞察力有限。本研究引入了一种新颖的解码范式——高维探针，旨在从LLM向量空间中解码信息。该探针融合了符号表征与神经探测的思想，通过向量符号架构（VSAs）将模型的残差流投射为可解释的概念。这一探针结合了SAEs与传统探针的优势，同时克服了它们的关键局限。我们通过控制输入完成任务验证了我们的解码范式，在涵盖句法模式识别、键值关联及抽象推理的输入上，探测模型在下一词预测前的最终状态。此外，我们还在问答场景中评估了该探针，考察了模型在文本生成前后的状态。实验表明，我们的探针能够可靠地提取跨不同LLMs、嵌入大小及输入领域的有意义概念，并有助于识别LLM的失败案例。本研究推动了LLM向量空间中的信息解码，使得从神经表征中提取更具信息性、可解释性和结构化的特征成为可能。

English

Despite their capabilities, Large Language Models (LLMs) remain opaque with limited understanding of their internal representations. Current interpretability methods, such as direct logit attribution (DLA) and sparse autoencoders (SAEs), provide restricted insight due to limitations such as the model's output vocabulary or unclear feature names. This work introduces Hyperdimensional Probe, a novel paradigm for decoding information from the LLM vector space. It combines ideas from symbolic representations and neural probing to project the model's residual stream into interpretable concepts via Vector Symbolic Architectures (VSAs). This probe combines the strengths of SAEs and conventional probes while overcoming their key limitations. We validate our decoding paradigm with controlled input-completion tasks, probing the model's final state before next-token prediction on inputs spanning syntactic pattern recognition, key-value associations, and abstract inference. We further assess it in a question-answering setting, examining the state of the model both before and after text generation. Our experiments show that our probe reliably extracts meaningful concepts across varied LLMs, embedding sizes, and input domains, also helping identify LLM failures. Our work advances information decoding in LLM vector space, enabling extracting more informative, interpretable, and structured features from neural representations.

高维探针：通过向量符号架构解码大语言模型表示

Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures

摘要

Support